[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-docarray--docarray":3,"tool-docarray--docarray":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",156804,2,"2026-04-15T11:34:33",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":64,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":64,"owner_website":78,"owner_url":79,"languages":80,"stars":96,"forks":97,"last_commit_at":98,"license":99,"difficulty_score":100,"env_os":101,"env_gpu":102,"env_ram":101,"env_deps":103,"category_tags":111,"github_topics":113,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":131,"updated_at":132,"faqs":133,"releases":169},7842,"docarray\u002Fdocarray","docarray","Represent, send, store and search multimodal data","DocArray 是一个专为多模态数据设计的 Python 库，旨在帮助开发者轻松表示、传输、存储和检索图像、文本、音频等复杂非结构化数据。在构建多模态 AI 应用时，处理不同类型数据的混合结构往往令人头疼，DocArray 通过提供统一且灵活的数据结构，有效解决了这一痛点，让数据在不同系统间的流转更加顺畅。\n\n这款工具特别适合 AI 工程师、研究人员以及后端开发者使用，尤其是那些正在利用 PyTorch、TensorFlow 或 JAX 进行模型训练，或需要构建基于向量数据库检索系统的团队。DocArray 的独特之处在于其强大的生态兼容性：它原生支持主流深度学习框架，能无缝对接 NumPy 数组；基于 Pydantic 构建，可快速集成到 FastAPI 等 Web 服务中；同时内置了对 Weaviate、Qdrant、Redis 等多种向量数据库的支持。此外，它还支持通过 HTTP (JSON) 或 gRPC (Protobuf) 高效传输数据。作为 LF AI & Data 基金会下的开源项目，DocArray 以 Apache 2.0 协议开放，是连接数据处理与多模态 AI ","DocArray 是一个专为多模态数据设计的 Python 库，旨在帮助开发者轻松表示、传输、存储和检索图像、文本、音频等复杂非结构化数据。在构建多模态 AI 应用时，处理不同类型数据的混合结构往往令人头疼，DocArray 通过提供统一且灵活的数据结构，有效解决了这一痛点，让数据在不同系统间的流转更加顺畅。\n\n这款工具特别适合 AI 工程师、研究人员以及后端开发者使用，尤其是那些正在利用 PyTorch、TensorFlow 或 JAX 进行模型训练，或需要构建基于向量数据库检索系统的团队。DocArray 的独特之处在于其强大的生态兼容性：它原生支持主流深度学习框架，能无缝对接 NumPy 数组；基于 Pydantic 构建，可快速集成到 FastAPI 等 Web 服务中；同时内置了对 Weaviate、Qdrant、Redis 等多种向量数据库的支持。此外，它还支持通过 HTTP (JSON) 或 gRPC (Protobuf) 高效传输数据。作为 LF AI & Data 基金会下的开源项目，DocArray 以 Apache 2.0 协议开放，是连接数据处理与多模态 AI 应用的理想桥梁。","\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fblob\u002Fmain\u002Fdocs\u002Fassets\u002Flogo-dark.svg?raw=true\" alt=\"DocArray logo: The data structure for unstructured data\" width=\"150px\">\n\u003Cbr>\n\u003Cb>The data structure for multimodal data\u003C\u002Fb>\n\u003C\u002Fp>\n\n\u003Cp align=center>\n\u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fdocarray\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fdocarray?style=flat-square&amp;label=Release\" alt=\"PyPI\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fbestpractices.coreinfrastructure.org\u002Fprojects\u002F6554\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdocarray_docarray_readme_43b29938ccd7.png\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fcodecov.io\u002Fgh\u002Fdocarray\u002Fdocarray\">\u003Cimg alt=\"Codecov branch\" src=\"https:\u002F\u002Fimg.shields.io\u002Fcodecov\u002Fc\u002Fgithub\u002Fdocarray\u002Fdocarray\u002Fmain?&logo=Codecov&logoColor=white&style=flat-square\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fpypistats.org\u002Fpackages\u002Fdocarray\">\u003Cimg alt=\"PyPI - Downloads from official pypistats\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fdm\u002Fdocarray?style=flat-square\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FWaMp6PVPgR\">\u003Cimg src=\"https:\u002F\u002Fdcbadge.vercel.app\u002Fapi\u002Fserver\u002FWaMp6PVPgR?theme=default-inverted&style=flat-square\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n> **Note**\n> The README you're currently viewing is for DocArray>0.30, which introduces some significant changes from DocArray 0.21. If you wish to continue using the older DocArray \u003C=0.21, ensure you install it via `pip install docarray==0.21`. Refer to its [codebase](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Ftree\u002Fv0.21.0), [documentation](https:\u002F\u002Fdocarray.jina.ai), and [its hot-fixes branch](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Ftree\u002Fdocarray-v1-fixes) for more information.\n\n\nDocArray is a Python library expertly crafted for the [representation](#represent), [transmission](#send), [storage](#store), and [retrieval](#retrieve) of multimodal data. Tailored for the development of multimodal AI applications, its design guarantees seamless integration with the extensive Python and machine learning ecosystems. As of January 2022, DocArray is openly distributed under the [Apache License 2.0](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fblob\u002Fmain\u002FLICENSE.md) and currently enjoys the status of a sandbox project within the [LF AI & Data Foundation](https:\u002F\u002Flfaidata.foundation\u002F).\n\n\n\n- :fire: Offers native support for **[NumPy](https:\u002F\u002Fgithub.com\u002Fnumpy\u002Fnumpy)**, **[PyTorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch)**, **[TensorFlow](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensorflow)**, and **[JAX](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax)**, catering specifically to **model training scenarios**.\n- :zap: Based on **[Pydantic](https:\u002F\u002Fgithub.com\u002Fpydantic\u002Fpydantic)**, and instantly compatible with web and microservice frameworks like **[FastAPI](https:\u002F\u002Fgithub.com\u002Ftiangolo\u002Ffastapi\u002F)** and **[Jina](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fjina\u002F)**.\n- :package: Provides support for vector databases such as **[Weaviate](https:\u002F\u002Fweaviate.io\u002F), [Qdrant](https:\u002F\u002Fqdrant.tech\u002F), [ElasticSearch](https:\u002F\u002Fwww.elastic.co\u002Fde\u002Felasticsearch\u002F), **[Redis](https:\u002F\u002Fredis.io\u002F)**, **[Mongo Atlas](https:\u002F\u002Fwww.mongodb.com\u002F)**, and **[HNSWLib](https:\u002F\u002Fgithub.com\u002Fnmslib\u002Fhnswlib)**.\n- :chains: Allows data transmission as JSON over **HTTP** or as **[Protobuf](https:\u002F\u002Fprotobuf.dev\u002F)** over **[gRPC](https:\u002F\u002Fgrpc.io\u002F)**.\n\n## Installation\n\nTo install DocArray from the CLI, run the following command:\n\n```shell\npip install -U docarray\n```\n\n> **Note**\n> To use DocArray \u003C=0.21, make sure you install via `pip install docarray==0.21` and check out its [codebase](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Ftree\u002Fv0.21.0) and [docs](https:\u002F\u002Fdocarray.jina.ai) and [its hot-fixes branch](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Ftree\u002Fdocarray-v1-fixes).\n\n## Get Started\nNew to DocArray? Depending on your use case and background, there are multiple ways to learn about DocArray:\n \n- [Coming from pure PyTorch or TensorFlow](#coming-from-pytorch)\n- [Coming from Pydantic](#coming-from-pydantic)\n- [Coming from FastAPI](#coming-from-fastapi)\n- [Coming from Jina](#coming-from-jina)\n- [Coming from a vector database](#coming-from-a-vector-database)\n- [Coming from Langchain](#coming-from-langchain)\n\n\n## Represent\n\nDocArray empowers you to **represent your data** in a manner that is inherently attuned to machine learning.\n\nThis is particularly beneficial for various scenarios:\n\n- :running: You are **training a model**: You're dealing with tensors of varying shapes and sizes, each signifying different elements. You desire a method to logically organize them.\n- :cloud: You are **serving a model**: Let's say through FastAPI, and you wish to define your API endpoints precisely.\n- :card_index_dividers: You are **parsing data**: Perhaps for future deployment in your machine learning or data science projects.\n\n> :bulb: **Familiar with Pydantic?** You'll be pleased to learn\n> that DocArray is not only constructed atop Pydantic but also maintains complete compatibility with it!\n> Furthermore, we have a [specific section](#coming-from-pydantic) dedicated to your needs!\n\nIn essence, DocArray facilitates data representation in a way that mirrors Python dataclasses, with machine learning being an integral component:\n\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.typing import TorchTensor, ImageUrl\nimport torch\n\n\n# Define your data model\nclass MyDocument(BaseDoc):\n    description: str\n    image_url: ImageUrl  # could also be VideoUrl, AudioUrl, etc.\n    image_tensor: TorchTensor[1704, 2272, 3]  # you can express tensor shapes!\n\n\n# Stack multiple documents in a Document Vector\nfrom docarray import DocVec\n\nvec = DocVec[MyDocument](\n    [\n        MyDocument(\n            description=\"A cat\",\n            image_url=\"https:\u002F\u002Fexample.com\u002Fcat.jpg\",\n            image_tensor=torch.rand(1704, 2272, 3),\n        ),\n    ]\n    * 10\n)\nprint(vec.image_tensor.shape)  # (10, 1704, 2272, 3)\n```\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>Click for more details\u003C\u002Fsummary>\n\nLet's take a closer look at how you can represent your data with DocArray:\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.typing import TorchTensor, ImageUrl\nfrom typing import Optional\nimport torch\n\n\n# Define your data model\nclass MyDocument(BaseDoc):\n    description: str\n    image_url: ImageUrl  # could also be VideoUrl, AudioUrl, etc.\n    image_tensor: Optional[\n        TorchTensor[1704, 2272, 3]\n    ] = None  # could also be NdArray or TensorflowTensor\n    embedding: Optional[TorchTensor] = None\n```\n\nSo not only can you define the types of your data, you can even **specify the shape of your tensors!**\n\n```python\n# Create a document\ndoc = MyDocument(\n    description=\"This is a photo of a mountain\",\n    image_url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n)\n\n# Load image tensor from URL\ndoc.image_tensor = doc.image_url.load()\n\n\n# Compute embedding with any model of your choice\ndef clip_image_encoder(image_tensor: TorchTensor) -> TorchTensor:  # dummy function\n    return torch.rand(512)\n\n\ndoc.embedding = clip_image_encoder(doc.image_tensor)\n\nprint(doc.embedding.shape)  # torch.Size([512])\n```\n\n### Compose nested Documents\n\nOf course, you can compose Documents into a nested structure:\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.documents import ImageDoc, TextDoc\nimport numpy as np\n\n\nclass MultiModalDocument(BaseDoc):\n    image_doc: ImageDoc\n    text_doc: TextDoc\n\n\ndoc = MultiModalDocument(\n    image_doc=ImageDoc(tensor=np.zeros((3, 224, 224))), text_doc=TextDoc(text='hi!')\n)\n```\n\nYou rarely work with a single data point at a time, especially in machine learning applications. That's why you can easily collect multiple `Documents`:\n\n### Collect multiple `Documents`\n\nWhen building or interacting with an ML system, usually you want to process multiple Documents (data points) at once.\n\nDocArray offers two data structures for this:\n\n- **`DocVec`**: A vector of `Documents`. All tensors in the documents are stacked into a single tensor. **Perfect for batch processing and use inside of ML models**.\n- **`DocList`**: A list of `Documents`. All tensors in the documents are kept as-is. **Perfect for streaming, re-ranking, and shuffling of data**.\n\nLet's take a look at them, starting with `DocVec`:\n\n```python\nfrom docarray import DocVec, BaseDoc\nfrom docarray.typing import AnyTensor, ImageUrl\nimport numpy as np\n\n\nclass Image(BaseDoc):\n    url: ImageUrl\n    tensor: AnyTensor  # this allows torch, numpy, and tensor flow tensors\n\n\nvec = DocVec[Image](  # the DocVec is parametrized by your personal schema!\n    [\n        Image(\n            url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n            tensor=np.zeros((3, 224, 224)),\n        )\n        for _ in range(100)\n    ]\n)\n``` \n\nIn the code snippet above, `DocVec` is **parametrized by the type of document** you want to use with it: `DocVec[Image]`.\n\nThis may look weird at first, but we're confident that you'll get used to it quickly!\nBesides, it lets us do some cool things, like having **bulk access to the fields that you defined** in your document:\n\n```python\ntensor = vec.tensor  # gets all the tensors in the DocVec\nprint(tensor.shape)  # which are stacked up into a single tensor!\nprint(vec.url)  # you can bulk access any other field, too\n```\n\nThe second data structure, `DocList`, works in a similar way:\n\n```python\nfrom docarray import DocList\n\ndl = DocList[Image](  # the DocList is parametrized by your personal schema!\n    [\n        Image(\n            url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n            tensor=np.zeros((3, 224, 224)),\n        )\n        for _ in range(100)\n    ]\n)\n```\n\nYou can still bulk access the fields of your document:\n\n```python\ntensors = dl.tensor  # gets all the tensors in the DocList\nprint(type(tensors))  # as a list of tensors\nprint(dl.url)  # you can bulk access any other field, too\n```\n\nAnd you can insert, remove, and append documents to your `DocList`:\n\n```python\n# append\ndl.append(\n    Image(\n        url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n        tensor=np.zeros((3, 224, 224)),\n    )\n)\n# delete\ndel dl[0]\n# insert\ndl.insert(\n    0,\n    Image(\n        url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n        tensor=np.zeros((3, 224, 224)),\n    ),\n)\n```\n\nAnd you can seamlessly switch between `DocVec` and `DocList`:\n\n```python\nvec_2 = dl.to_doc_vec()\nassert isinstance(vec_2, DocVec)\n\ndl_2 = vec_2.to_doc_list()\nassert isinstance(dl_2, DocList)\n```\n\n\u003C\u002Fdetails>\n\n## Send\n\nDocArray facilitates the **transmission of your data** in a manner inherently compatible with machine learning.\n\nThis includes native support for **Protobuf and gRPC**, along with **HTTP** and serialization to JSON, JSONSchema, Base64, and Bytes.\n\nThis feature proves beneficial for several scenarios:\n\n- :cloud: You are **serving a model**, perhaps through frameworks like **[Jina](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fjina\u002F)** or **[FastAPI](https:\u002F\u002Fgithub.com\u002Ftiangolo\u002Ffastapi\u002F)**\n- :spider_web: You are **distributing your model** across multiple machines and need an efficient means of transmitting your data between nodes\n- :gear: You are architecting a **microservice** environment and require a method for data transmission between microservices\n\n> :bulb: **Are you familiar with FastAPI?** You'll be delighted to learn\n> that DocArray maintains full compatibility with FastAPI!\n> Plus, we have a [dedicated section](#coming-from-fastapi) specifically for you!\n\nWhen it comes to data transmission, serialization is a crucial step. Let's delve into how DocArray streamlines this process:\n\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.typing import ImageTorchTensor\nimport torch\n\n\n# model your data\nclass MyDocument(BaseDoc):\n    description: str\n    image: ImageTorchTensor[3, 224, 224]\n\n\n# create a Document\ndoc = MyDocument(\n    description=\"This is a description\",\n    image=torch.zeros((3, 224, 224)),\n)\n\n# serialize it!\nproto = doc.to_protobuf()\nbytes_ = doc.to_bytes()\njson = doc.json()\n\n# deserialize it!\ndoc_2 = MyDocument.from_protobuf(proto)\ndoc_4 = MyDocument.from_bytes(bytes_)\ndoc_5 = MyDocument.parse_raw(json)\n```\n\nOf course, serialization is not all you need. So check out how DocArray integrates with **[Jina](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fjina\u002F)** and **[FastAPI](https:\u002F\u002Fgithub.com\u002Ftiangolo\u002Ffastapi\u002F)**.\n\n## Store\n\nAfter modeling and possibly distributing your data, you'll typically want to **store it** somewhere. That's where DocArray steps in!\n\n**Document Stores** provide a seamless way to, as the name suggests, store your Documents. Be it locally or remotely, you can do it all through the same user interface:\n\n- :cd: **On disk**, as a file in your local filesystem\n- :bucket: On **[AWS S3](https:\u002F\u002Faws.amazon.com\u002Fde\u002Fs3\u002F)**\n- :cloud: On **[Jina AI Cloud](https:\u002F\u002Fcloud.jina.ai\u002F)**\n\nThe Document Store interface lets you push and pull Documents to and from multiple data sources, all with the same user interface.\n\nFor example, let's see how that works with on-disk storage:\n\n```python\nfrom docarray import BaseDoc, DocList\n\n\nclass SimpleDoc(BaseDoc):\n    text: str\n\n\ndocs = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(8)])\ndocs.push('file:\u002F\u002Fsimple_docs')\n\ndocs_pull = DocList[SimpleDoc].pull('file:\u002F\u002Fsimple_docs')\n```\n\n## Retrieve\n\n**Document Indexes** let you index your Documents in a **vector database** for efficient similarity-based retrieval.\n\nThis is useful for:\n\n- :left_speech_bubble: Augmenting **LLMs and Chatbots** with domain knowledge ([Retrieval Augmented Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.11401))\n- :mag: **Neural search** applications\n- :bulb: **Recommender systems**\n\nCurrently, Document Indexes support **[Weaviate](https:\u002F\u002Fweaviate.io\u002F)**, **[Qdrant](https:\u002F\u002Fqdrant.tech\u002F)**, **[ElasticSearch](https:\u002F\u002Fwww.elastic.co\u002F)**,  **[Redis](https:\u002F\u002Fredis.io\u002F)**, **[Mongo Atlas](https:\u002F\u002Fwww.mongodb.com\u002F)**, and **[HNSWLib](https:\u002F\u002Fgithub.com\u002Fnmslib\u002Fhnswlib)**, with more to come!\n\nThe Document Index interface lets you index and retrieve Documents from multiple vector databases, all with the same user interface.\n\nIt supports ANN vector search, text search, filtering, and hybrid search.\n\n```python\nfrom docarray import DocList, BaseDoc\nfrom docarray.index import HnswDocumentIndex\nimport numpy as np\n\nfrom docarray.typing import ImageUrl, ImageTensor, NdArray\n\n\nclass ImageDoc(BaseDoc):\n    url: ImageUrl\n    tensor: ImageTensor\n    embedding: NdArray[128]\n\n\n# create some data\ndl = DocList[ImageDoc](\n    [\n        ImageDoc(\n            url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n            tensor=np.zeros((3, 224, 224)),\n            embedding=np.random.random((128,)),\n        )\n        for _ in range(100)\n    ]\n)\n\n# create a Document Index\nindex = HnswDocumentIndex[ImageDoc](work_dir='\u002Ftmp\u002Ftest_index')\n\n\n# index your data\nindex.index(dl)\n\n# find similar Documents\nquery = dl[0]\nresults, scores = index.find(query, limit=10, search_field='embedding')\n```\n\n---\n\n## Learn DocArray\n\nDepending on your background and use case, there are different ways for you to understand DocArray.\n\n### Coming from DocArray \u003C=0.21\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>Click to expand\u003C\u002Fsummary>\n\nIf you are using DocArray version 0.30.0 or lower, you will be familiar with its [dataclass API](https:\u002F\u002Fdocarray.jina.ai\u002Ffundamentals\u002Fdataclass\u002F).\n\n_DocArray >=0.30 is that idea, taken seriously._ Every document is created through a dataclass-like interface,\ncourtesy of [Pydantic](https:\u002F\u002Fpydantic-docs.helpmanual.io\u002Fusage\u002Fmodels\u002F).\n\nThis gives the following advantages:\n- **Flexibility:** No need to conform to a fixed set of fields -- your data defines the schema\n- **Multimodality:** At their core, documents are just dictionaries. This makes it easy to create and send them from any language, not just Python.\n\nYou may also be familiar with our old Document Stores for vector DB integration.\nThey are now called **Document Indexes** and offer the following improvements (see [here](#store) for the new API):\n\n- **Hybrid search:** You can now combine vector search with text search, and even filter by arbitrary fields\n- **Production-ready:** The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain\n- **Increased flexibility:** We strive to support any configuration or setting that you could perform through the DB's first-party client\n\nFor now, Document Indexes support **[Weaviate](https:\u002F\u002Fweaviate.io\u002F)**, **[Qdrant](https:\u002F\u002Fqdrant.tech\u002F)**, **[ElasticSearch](https:\u002F\u002Fwww.elastic.co\u002F)**, **[Redis](https:\u002F\u002Fredis.io\u002F)**, **[Mongo Atlas](https:\u002F\u002Fwww.mongodb.com\u002F)**, Exact Nearest Neighbour search and **[HNSWLib](https:\u002F\u002Fgithub.com\u002Fnmslib\u002Fhnswlib)**, with more to come.\n\n\u003C\u002Fdetails>\n\n### Coming from Pydantic\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>Click to expand\u003C\u002Fsummary>\n\nIf you come from Pydantic, you can see DocArray documents as juiced up Pydantic models, and DocArray as a collection of goodies around them.\n\nMore specifically, we set out to **make Pydantic fit for the ML world** - not by replacing it, but by building on top of it!\n\nThis means you get the following benefits:\n\n- **ML-focused types**: Tensor, TorchTensor, Embedding, ..., including **tensor shape validation**\n- Full compatibility with **FastAPI**\n- **DocList** and **DocVec** generalize the idea of a model to a _sequence_ or _batch_ of models. Perfect for **use in ML models** and other batch processing tasks.\n- **Types that are alive**: ImageUrl can `.load()` a URL to image tensor, TextUrl can load and tokenize text documents, etc.\n- Cloud-ready: Serialization to **Protobuf** for use with microservices and **gRPC**\n- **Pre-built multimodal documents** for different data modalities: Image, Text, 3DMesh, Video, Audio and more. Note that all of these are valid Pydantic models!\n- **Document Stores** and **Document Indexes** let you store your data and retrieve it using **vector search**\n\nThe most obvious advantage here is **first-class support for ML centric data**, such as `{Torch, TF, ...}Tensor`, `Embedding`, etc.\n\nThis includes handy features such as validating the shape of a tensor:\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.typing import TorchTensor\nimport torch\n\n\nclass MyDoc(BaseDoc):\n    tensor: TorchTensor[3, 224, 224]\n\n\ndoc = MyDoc(tensor=torch.zeros(3, 224, 224))  # works\ndoc = MyDoc(tensor=torch.zeros(224, 224, 3))  # works by reshaping\n\ntry:\n    doc = MyDoc(tensor=torch.zeros(224))  # fails validation\nexcept Exception as e:\n    print(e)\n    # tensor\n    # Cannot reshape tensor of shape (224,) to shape (3, 224, 224) (type=value_error)\n\n\nclass Image(BaseDoc):\n    tensor: TorchTensor[3, 'x', 'x']\n\n\nImage(tensor=torch.zeros(3, 224, 224))  # works\n\ntry:\n    Image(\n        tensor=torch.zeros(3, 64, 128)\n    )  # fails validation because second dimension does not match third\nexcept Exception as e:\n    print()\n\n\ntry:\n    Image(\n        tensor=torch.zeros(4, 224, 224)\n    )  # fails validation because of the first dimension\nexcept Exception as e:\n    print(e)\n    # Tensor shape mismatch. Expected(3, 'x', 'x'), got(4, 224, 224)(type=value_error)\n\ntry:\n    Image(\n        tensor=torch.zeros(3, 64)\n    )  # fails validation because it does not have enough dimensions\nexcept Exception as e:\n    print(e)\n    # Tensor shape mismatch. Expected (3, 'x', 'x'), got (3, 64) (type=value_error)\n```\n\n\u003C\u002Fdetails>\n\n### Coming from PyTorch\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>Click to expand\u003C\u002Fsummary>\n\nIf you come from PyTorch, you can see DocArray mainly as a way of _organizing your data as it flows through your model_.\n\nIt offers you several advantages:\n\n- Express **tensor shapes in type hints**\n- **Group tensors that belong to the same object**, e.g. an audio track and an image\n- **Go directly to deployment**, by re-using your data model as a [FastAPI](https:\u002F\u002Ffastapi.tiangolo.com\u002F) or [Jina](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fjina) API schema\n- Connect model components between **microservices**, using Protobuf and gRPC\n\nDocArray can be used directly inside ML models to handle and represent multimodaldata.\nThis allows you to reason about your data using DocArray's abstractions deep inside of `nn.Module`,\nand provides a FastAPI-compatible schema that eases the transition between model training and model serving.\n\nTo see the effect of this, let's first observe a vanilla PyTorch implementation of a tri-modal ML model:\n\n```python\nimport torch\nfrom torch import nn\n\n\ndef encoder(x):\n    return torch.rand(512)\n\n\nclass MyMultiModalModel(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.audio_encoder = encoder()\n        self.image_encoder = encoder()\n        self.text_encoder = encoder()\n\n    def forward(self, text_1, text_2, image_1, image_2, audio_1, audio_2):\n        embedding_text_1 = self.text_encoder(text_1)\n        embedding_text_2 = self.text_encoder(text_2)\n\n        embedding_image_1 = self.image_encoder(image_1)\n        embedding_image_2 = self.image_encoder(image_2)\n\n        embedding_audio_1 = self.image_encoder(audio_1)\n        embedding_audio_2 = self.image_encoder(audio_2)\n\n        return (\n            embedding_text_1,\n            embedding_text_2,\n            embedding_image_1,\n            embedding_image_2,\n            embedding_audio_1,\n            embedding_audio_2,\n        )\n```\n\nNot very easy on the eyes if you ask us. And even worse, if you need to add one more modality you have to touch every part of your code base, changing the `forward()` return type and making a whole lot of changes downstream from that.\n\nSo, now let's see what the same code looks like with DocArray:\n\n```python\nfrom docarray import DocList, BaseDoc\nfrom docarray.documents import ImageDoc, TextDoc, AudioDoc\nfrom docarray.typing import TorchTensor\nfrom torch import nn\nimport torch\n\n\ndef encoder(x):\n    return torch.rand(512)\n\n\nclass Podcast(BaseDoc):\n    text: TextDoc\n    image: ImageDoc\n    audio: AudioDoc\n\n\nclass PairPodcast(BaseDoc):\n    left: Podcast\n    right: Podcast\n\n\nclass MyPodcastModel(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.audio_encoder = encoder()\n        self.image_encoder = encoder()\n        self.text_encoder = encoder()\n\n    def forward_podcast(self, docs: DocList[Podcast]) -> DocList[Podcast]:\n        docs.audio.embedding = self.audio_encoder(docs.audio.tensor)\n        docs.text.embedding = self.text_encoder(docs.text.tensor)\n        docs.image.embedding = self.image_encoder(docs.image.tensor)\n\n        return docs\n\n    def forward(self, docs: DocList[PairPodcast]) -> DocList[PairPodcast]:\n        docs.left = self.forward_podcast(docs.left)\n        docs.right = self.forward_podcast(docs.right)\n\n        return docs\n```\n\nLooks much better, doesn't it?\nYou instantly win in code readability and maintainability. And for the same price you can turn your PyTorch model into a FastAPI app and reuse your Document\nschema definition (see [below](#coming-from-fastapi)). Everything is handled in a pythonic manner by relying on type hints.\n\n\u003C\u002Fdetails>\n\n\n### Coming from TensorFlow\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>Click to expand\u003C\u002Fsummary>\n\nLike the [PyTorch approach](#coming-from-pytorch), you can also use DocArray with TensorFlow to handle and represent multimodal data inside your ML model.\n\nFirst off, to use DocArray with TensorFlow we first need to install it as follows:\n\n```\npip install tensorflow==2.12.0\npip install protobuf==3.19.0\n```\n\nCompared to using DocArray with PyTorch, there is one main difference when using it with TensorFlow:\nWhile DocArray's `TorchTensor` is a subclass of `torch.Tensor`, this is not the case for the `TensorFlowTensor`: Due to some technical limitations of `tf.Tensor`, DocArray's `TensorFlowTensor` is not a subclass of `tf.Tensor` but rather stores a `tf.Tensor` in its `.tensor` attribute. \n\nHow does this affect you? Whenever you want to access the tensor data to, let's say, do operations with it or hand it to your ML model, instead of handing over your `TensorFlowTensor` instance, you need to access its `.tensor` attribute.\n\nThis would look like the following:\n\n```python\nfrom typing import Optional\n\nfrom docarray import DocList, BaseDoc\n\nimport tensorflow as tf\n\n\nclass Podcast(BaseDoc):\n    audio_tensor: Optional[AudioTensorFlowTensor] = None\n    embedding: Optional[AudioTensorFlowTensor] = None\n\n\nclass MyPodcastModel(tf.keras.Model):\n    def __init__(self):\n        super().__init__()\n        self.audio_encoder = AudioEncoder()\n\n    def call(self, inputs: DocList[Podcast]) -> DocList[Podcast]:\n        inputs.audio_tensor.embedding = self.audio_encoder(\n            inputs.audio_tensor.tensor\n        )  # access audio_tensor's .tensor attribute\n        return inputs\n```\n\n\u003C\u002Fdetails>\n\n### Coming from FastAPI\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>Click to expand\u003C\u002Fsummary>\n\nDocuments are Pydantic Models (with a twist), and as such they are fully compatible with FastAPI!\n\nBut why should you use them, and not the Pydantic models you already know and love?\nGood question!\n\n- Because of the ML-first features, types and validations, [here](#coming-from-pydantic)\n- Because DocArray can act as an [ORM for vector databases](#coming-from-a-vector-database), similar to what SQLModel does for SQL databases\n\nAnd to seal the deal, let us show you how easily documents slot into your FastAPI app:\n\n```python\nimport numpy as np\nfrom fastapi import FastAPI\nfrom docarray.base_doc import DocArrayResponse\nfrom docarray import BaseDoc\nfrom docarray.documents import ImageDoc\nfrom docarray.typing import NdArray, ImageTensor\n\n\nclass InputDoc(BaseDoc):\n    img: ImageDoc\n    text: str\n\n\nclass OutputDoc(BaseDoc):\n    embedding_clip: NdArray\n    embedding_bert: NdArray\n\n\napp = FastAPI()\n\n\ndef model_img(img: ImageTensor) -> NdArray:\n    return np.zeros((100, 1))\n\n\ndef model_text(text: str) -> NdArray:\n    return np.zeros((100, 1))\n\n\n@app.post(\"\u002Fembed\u002F\", response_model=OutputDoc, response_class=DocArrayResponse)\nasync def create_item(doc: InputDoc) -> OutputDoc:\n    doc = OutputDoc(\n        embedding_clip=model_img(doc.img.tensor), embedding_bert=model_text(doc.text)\n    )\n    return doc\n\n\ninput_doc = InputDoc(text='', img=ImageDoc(tensor=np.random.random((3, 224, 224))))\n\nasync with AsyncClient(app=app, base_url=\"http:\u002F\u002Ftest\") as ac:\n    response = await ac.post(\"\u002Fembed\u002F\", data=input_doc.json())\n```\n\nJust like a vanilla Pydantic model!\n\n\u003C\u002Fdetails>\n\n### Coming from Jina\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>Click to expand\u003C\u002Fsummary>\n\nJina has adopted docarray as their library for representing and serializing Documents.\n\nJina allows to serve models and services that are built with DocArray allowing you to serve and scale these applications\nmaking full use of DocArray's serialization capabilites. \n\n```python\nimport numpy as np\nfrom jina import Deployment, Executor, requests\nfrom docarray import BaseDoc, DocList\nfrom docarray.documents import ImageDoc\nfrom docarray.typing import NdArray, ImageTensor\n\n\nclass InputDoc(BaseDoc):\n    img: ImageDoc\n    text: str\n\n\nclass OutputDoc(BaseDoc):\n    embedding_clip: NdArray\n    embedding_bert: NdArray\n\n\ndef model_img(img: ImageTensor) -> NdArray:\n    return np.zeros((100, 1))\n\n\ndef model_text(text: str) -> NdArray:\n    return np.zeros((100, 1))\n\n\nclass MyEmbeddingExecutor(Executor):\n    @requests(on='\u002Fembed')\n    def encode(self, docs: DocList[InputDoc], **kwargs) -> DocList[OutputDoc]:\n        ret = DocList[OutputDoc]()\n        for doc in docs:\n            output = OutputDoc(\n                embedding_clip=model_img(doc.img.tensor),\n                embedding_bert=model_text(doc.text),\n            )\n            ret.append(output)\n        return ret\n\n\nwith Deployment(\n    protocols=['grpc', 'http'], ports=[12345, 12346], uses=MyEmbeddingExecutor\n) as dep:\n    resp = dep.post(\n        on='\u002Fembed',\n        inputs=DocList[InputDoc](\n            [InputDoc(text='', img=ImageDoc(tensor=np.random.random((3, 224, 224))))]\n        ),\n        return_type=DocList[OutputDoc],\n    )\n    print(resp)\n```\n\n\u003C\u002Fdetails>\n\n### Coming from a vector database\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>Click to expand\u003C\u002Fsummary>\n\nIf you came across DocArray as a universal vector database client, you can best think of it as **a new kind of ORM for vector databases**.\nDocArray's job is to take multimodal, nested and domain-specific data and to map it to a vector database,\nstore it there, and thus make it searchable:\n\n```python\nfrom docarray import DocList, BaseDoc\nfrom docarray.index import HnswDocumentIndex\nimport numpy as np\n\nfrom docarray.typing import ImageUrl, ImageTensor, NdArray\n\n\nclass ImageDoc(BaseDoc):\n    url: ImageUrl\n    tensor: ImageTensor\n    embedding: NdArray[128]\n\n\n# create some data\ndl = DocList[ImageDoc](\n    [\n        ImageDoc(\n            url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n            tensor=np.zeros((3, 224, 224)),\n            embedding=np.random.random((128,)),\n        )\n        for _ in range(100)\n    ]\n)\n\n# create a Document Index\nindex = HnswDocumentIndex[ImageDoc](work_dir='\u002Ftmp\u002Ftest_index2')\n\n\n# index your data\nindex.index(dl)\n\n# find similar Documents\nquery = dl[0]\nresults, scores = index.find(query, limit=10, search_field='embedding')\n```\n\nCurrently, DocArray supports the following vector databases:\n\n- [Weaviate](https:\u002F\u002Fwww.weaviate.io\u002F)\n- [Qdrant](https:\u002F\u002Fqdrant.tech\u002F)\n- [Elasticsearch](https:\u002F\u002Fwww.elastic.co\u002Felasticsearch\u002F) v8 and v7\n- [Redis](https:\u002F\u002Fredis.io\u002F)\n- [Milvus](https:\u002F\u002Fmilvus.io)\n- ExactNNMemorySearch as a local alternative with exact kNN search.\n- [HNSWlib](https:\u002F\u002Fgithub.com\u002Fnmslib\u002Fhnswlib) as a local-first ANN alternative\n- [Mongo Atlas](https:\u002F\u002Fwww.mongodb.com\u002F)\n\nAn integration of [OpenSearch](https:\u002F\u002Fopensearch.org\u002F) is currently in progress.\n\nOf course this is only one of the things that DocArray can do, so we encourage you to check out the rest of this readme!\n\n\u003C\u002Fdetails>\n\n\n### Coming from Langchain\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>Click to expand\u003C\u002Fsummary>\n\nWith DocArray, you can connect external data to LLMs through Langchain. DocArray gives you the freedom to establish \nflexible document schemas and choose from different backends for document storage.\nAfter creating your document index, you can connect it to your Langchain app using [DocArrayRetriever](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fmodules\u002Fdata_connection\u002Fretrievers\u002Fintegrations\u002Fdocarray_retriever).\n\nInstall Langchain via:\n```shell\npip install langchain\n```\n\n1. Define a schema and create documents:\n```python\nfrom docarray import BaseDoc, DocList\nfrom docarray.typing import NdArray\nfrom langchain.embeddings.openai import OpenAIEmbeddings\n\nembeddings = OpenAIEmbeddings()\n\n\n# Define a document schema\nclass MovieDoc(BaseDoc):\n    title: str\n    description: str\n    year: int\n    embedding: NdArray[1536]\n\n\nmovies = [\n    {\"title\": \"#1 title\", \"description\": \"#1 description\", \"year\": 1999},\n    {\"title\": \"#2 title\", \"description\": \"#2 description\", \"year\": 2001},\n]\n\n# Embed `description` and create documents\ndocs = DocList[MovieDoc](\n    MovieDoc(embedding=embeddings.embed_query(movie[\"description\"]), **movie)\n    for movie in movies\n)\n```\n\n2. Initialize a document index using any supported backend:\n```python\nfrom docarray.index import (\n    InMemoryExactNNIndex,\n    HnswDocumentIndex,\n    WeaviateDocumentIndex,\n    QdrantDocumentIndex,\n    ElasticDocIndex,\n    RedisDocumentIndex,\n    MongoDBAtlasDocumentIndex,\n)\n\n# Select a suitable backend and initialize it with data\ndb = InMemoryExactNNIndex[MovieDoc](docs)\n```\n\n3. Finally, initialize a retriever and integrate it into your chain!\n```python\nfrom langchain.chat_models import ChatOpenAI\nfrom langchain.chains import ConversationalRetrievalChain\nfrom langchain.retrievers import DocArrayRetriever\n\n\n# Create a retriever\nretriever = DocArrayRetriever(\n    index=db,\n    embeddings=embeddings,\n    search_field=\"embedding\",\n    content_field=\"description\",\n)\n\n# Use the retriever in your chain\nmodel = ChatOpenAI()\nqa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)\n```\n\nAlternatively, you can use built-in vector stores. Langchain supports two vector stores: [DocArrayInMemorySearch](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fmodules\u002Fdata_connection\u002Fvectorstores\u002Fintegrations\u002Fdocarray_in_memory) and [DocArrayHnswSearch](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fmodules\u002Fdata_connection\u002Fvectorstores\u002Fintegrations\u002Fdocarray_hnsw). \nBoth are user-friendly and are best suited to small to medium-sized datasets.\n\n\u003C\u002Fdetails>\n\n\n## See also\n\n- [Documentation](https:\u002F\u002Fdocs.docarray.org)\n- [DocArray\u003C=0.21 documentation](https:\u002F\u002Fdocarray.jina.ai\u002F)\n- [Join our Discord server](https:\u002F\u002Fdiscord.gg\u002FWaMp6PVPgR)\n- [Donation to Linux Foundation AI&Data blog post](https:\u002F\u002Fjina.ai\u002Fnews\u002Fdonate-docarray-lf-for-inclusive-standard-multimodal-data-model\u002F)\n- [Roadmap](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fissues\u002F1714)\n\n> DocArray is a trademark of LF AI Projects, LLC\n> \n","\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fblob\u002Fmain\u002Fdocs\u002Fassets\u002Flogo-dark.svg?raw=true\" alt=\"DocArray logo: The data structure for unstructured data\" width=\"150px\">\n\u003Cbr>\n\u003Cb>多模态数据的数据结构\u003C\u002Fb>\n\u003C\u002Fp>\n\n\u003Cp align=center>\n\u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fdocarray\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fdocarray?style=flat-square&amp;label=Release\" alt=\"PyPI\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fbestpractices.coreinfrastructure.org\u002Fprojects\u002F6554\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdocarray_docarray_readme_43b29938ccd7.png\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fcodecov.io\u002Fgh\u002Fdocarray\u002Fdocarray\">\u003Cimg alt=\"Codecov branch\" src=\"https:\u002F\u002Fimg.shields.io\u002Fcodecov\u002Fc\u002Fgithub\u002Fdocarray\u002Fdocarray\u002Fmain?&logo=Codecov&logoColor=white&style=flat-square\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fpypistats.org\u002Fpackages\u002Fdocarray\">\u003Cimg alt=\"PyPI - Downloads from official pypistats\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fdm\u002Fdocarray?style=flat-square\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FWaMp6PVPgR\">\u003Cimg src=\"https:\u002F\u002Fdcbadge.vercel.app\u002Fapi\u002Fserver\u002FWaMp6PVPgR?theme=default-inverted&style=flat-square\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n> **注意**\n> 您当前查看的 README 是针对 DocArray 0.30 的，它相比 DocArray 0.21 引入了一些重大变化。如果您希望继续使用旧版本的 DocArray ≤0.21，请确保通过 `pip install docarray==0.21` 进行安装。更多信息请参考其 [代码库](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Ftree\u002Fv0.21.0)、[文档](https:\u002F\u002Fdocarray.jina.ai)以及 [修复分支](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Ftree\u002Fdocarray-v1-fixes)。\n\n\nDocArray 是一个 Python 库，专为多模态数据的 [表示](#represent)、[传输](#send)、[存储](#store) 和 [检索](#retrieve) 而精心设计。它专为多模态 AI 应用程序的开发而打造，其设计确保与广泛的 Python 和机器学习生态系统无缝集成。截至 2022 年 1 月，DocArray 以 [Apache License 2.0](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fblob\u002Fmain\u002FLICENSE.md) 开源发布，目前是 [LF AI & Data Foundation](https:\u002F\u002Flfaidata.foundation\u002F) 中的一个沙盒项目。\n\n\n\n- :fire: 原生支持 **[NumPy](https:\u002F\u002Fgithub.com\u002Fnumpy\u002Fnumpy)**、**[PyTorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch)**、**[TensorFlow](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensorflow)** 和 **[JAX](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fjax)**，特别适用于 **模型训练场景**。\n- :zap: 基于 **[Pydantic](https:\u002F\u002Fgithub.com\u002Fpydantic\u002Fpydantic)**，可立即与 Web 和微服务框架（如 **[FastAPI](https:\u002F\u002Fgithub.com\u002Ftiangolo\u002Ffastapi\u002F)** 和 **[Jina](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fjina\u002F)**）兼容。\n- :package: 支持多种向量数据库，包括 **[Weaviate](https:\u002F\u002Fweaviate.io\u002F)**、**[Qdrant](https:\u002F\u002Fqdrant.tech\u002F)**、**[ElasticSearch](https:\u002F\u002Fwww.elastic.co\u002Fde\u002Felasticsearch\u002F)**、**[Redis](https:\u002F\u002Fredis.io\u002F)**、**[Mongo Atlas](https:\u002F\u002Fwww.mongodb.com\u002F)** 和 **[HNSWLib](https:\u002F\u002Fgithub.com\u002Fnmslib\u002Fhnswlib)**。\n- :chains: 支持通过 **HTTP** 以 JSON 格式传输数据，或通过 **[gRPC](https:\u002F\u002Fgrpc.io\u002F)** 使用 **[Protobuf](https:\u002F\u002Fprotobuf.dev\u002F)** 进行传输。\n\n## 安装\n\n要从命令行安装 DocArray，请运行以下命令：\n\n```shell\npip install -U docarray\n```\n\n> **注意**\n> 如果您需要使用 DocArray ≤0.21，请确保通过 `pip install docarray==0.21` 进行安装，并查看其 [代码库](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Ftree\u002Fv0.21.0)、[文档](https:\u002F\u002Fdocarray.jina.ai)以及 [修复分支](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Ftree\u002Fdocarray-v1-fixes)。\n\n## 入门\n刚接触 DocArray？根据您的使用场景和背景，有多种方式可以了解 DocArray：\n \n- [来自纯 PyTorch 或 TensorFlow](#coming-from-pytorch)\n- [来自 Pydantic](#coming-from-pydantic)\n- [来自 FastAPI](#coming-from-fastapi)\n- [来自 Jina](#coming-from-jina)\n- [来自向量数据库](#coming-from-a-vector-database)\n- [来自 Langchain](#coming-from-langchain)\n\n\n## 表示\n\nDocArray 让您可以以一种天然契合机器学习的方式 **表示您的数据**。\n\n这在以下各种场景中尤为有用：\n\n- :running: 您正在 **训练模型**：您处理的是形状和大小各异的张量，每个张量代表不同的元素。您希望有一种方法来逻辑地组织它们。\n- :cloud: 您正在 **部署模型**：例如通过 FastAPI，您希望精确地定义 API 端点。\n- :card_index_dividers: 您正在 **解析数据**：也许是为了将来在您的机器学习或数据科学项目中使用。\n\n> :bulb: **熟悉 Pydantic 吗？** 您会很高兴地知道，\n> DocArray 不仅构建在 Pydantic 之上，而且与其完全兼容！\n> 此外，我们还有一个专门针对您的需求的 [部分](#coming-from-pydantic)！\n\n本质上，DocArray 以类似于 Python 数据类的方式进行数据表示，同时将机器学习作为其核心组成部分：\n\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.typing import TorchTensor, ImageUrl\nimport torch\n\n\n# 定义您的数据模型\nclass MyDocument(BaseDoc):\n    description: str\n    image_url: ImageUrl  # 也可以是 VideoUrl、AudioUrl 等\n    image_tensor: TorchTensor[1704, 2272, 3]  # 您可以指定张量的形状！\n\n\n# 将多个文档堆叠成一个文档向量\nfrom docarray import DocVec\n\nvec = DocVec[MyDocument](\n    [\n        MyDocument(\n            description=\"一只猫\",\n            image_url=\"https:\u002F\u002Fexample.com\u002Fcat.jpg\",\n            image_tensor=torch.rand(1704, 2272, 3),\n        ),\n    ]\n    * 10\n)\nprint(vec.image_tensor.shape)  # (10, 1704, 2272, 3)\n```\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>点击查看更多详情\u003C\u002Fsummary>\n\n让我们更详细地看看如何使用 DocArray 表示您的数据：\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.typing import TorchTensor, ImageUrl\nfrom typing import Optional\nimport torch\n\n\n# 定义您的数据模型\nclass MyDocument(BaseDoc):\n    description: str\n    image_url: ImageUrl  # 也可以是 VideoUrl、AudioUrl 等\n    image_tensor: Optional[\n        TorchTensor[1704, 2272, 3]\n    ] = None  # 也可以是 NdArray 或 TensorflowTensor\n    embedding: Optional[TorchTensor] = None\n```\n\n因此，您不仅可以定义数据的类型，还可以 **指定张量的形状！**\n\n```python\n# 创建一个文档\ndoc = MyDocument(\n    description=\"这是一张山的照片\",\n    image_url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n)\n\n# 从 URL 加载图像张量\ndoc.image_tensor = doc.image_url.load()\n\n\n# 使用您选择的任何模型计算嵌入\ndef clip_image_encoder(image_tensor: TorchTensor) -> TorchTensor:  # 虚拟函数\n    return torch.rand(512)\n\n\ndoc.embedding = clip_image_encoder(doc.image_tensor)\n\nprint(doc.embedding.shape)  # torch.Size([512])\n```\n\n### 组合嵌套的 Document\n\n当然，你可以将 Document 组合成嵌套结构：\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.documents import ImageDoc, TextDoc\nimport numpy as np\n\n\nclass MultiModalDocument(BaseDoc):\n    image_doc: ImageDoc\n    text_doc: TextDoc\n\n\ndoc = MultiModalDocument(\n    image_doc=ImageDoc(tensor=np.zeros((3, 224, 224))), text_doc=TextDoc(text='hi!')\n)\n```\n\n在实际应用中，尤其是机器学习领域，很少会单独处理一个数据点。因此，你可以轻松地收集多个 `Document`：\n\n### 收集多个 `Document`\n\n在构建或与机器学习系统交互时，通常需要一次处理多个 Document（即多个数据点）。\n\nDocArray 提供了两种数据结构来实现这一点：\n\n- **`DocVec`**：一组 `Document` 的向量。所有 Document 中的张量会被堆叠成一个单一的张量。**非常适合批量处理和在机器学习模型中使用**。\n- **`DocList`**：一组 `Document` 的列表。所有 Document 中的张量保持原样。**非常适合流式传输、重新排序和数据洗牌**。\n\n我们先来看看 `DocVec`：\n\n```python\nfrom docarray import DocVec, BaseDoc\nfrom docarray.typing import AnyTensor, ImageUrl\nimport numpy as np\n\n\nclass Image(BaseDoc):\n    url: ImageUrl\n    tensor: AnyTensor  # 这允许使用 PyTorch、NumPy 和 TensorFlow 张量\n\n\nvec = DocVec[Image](  # DocVec 根据你的自定义 Schema 进行参数化！\n    [\n        Image(\n            url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n            tensor=np.zeros((3, 224, 224)),\n        )\n        for _ in range(100)\n    ]\n)\n``` \n\n在上面的代码片段中，`DocVec` 是根据你想要使用的 Document 类型进行参数化的：`DocVec[Image]`。\n\n这看起来可能有些奇怪，但我们相信你会很快习惯！此外，它还允许我们做一些很酷的事情，比如**批量访问你在 Document 中定义的字段**：\n\n```python\ntensor = vec.tensor  # 获取 DocVec 中的所有张量\nprint(tensor.shape)  # 它们被堆叠成了一个单一的张量！\nprint(vec.url)  # 你也可以批量访问其他字段\n```\n\n第二种数据结构 `DocList` 的工作方式类似：\n\n```python\nfrom docarray import DocList\n\ndl = DocList[Image](  # DocList 根据你的自定义 Schema 进行参数化！\n    [\n        Image(\n            url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n            tensor=np.zeros((3, 224, 224)),\n        )\n        for _ in range(100)\n    ]\n)\n```\n\n你仍然可以批量访问 Document 中的字段：\n\n```python\ntensors = dl.tensor  # 获取 DocList 中的所有张量\nprint(type(tensors))  # 以张量列表的形式\nprint(dl.url)  # 你也可以批量访问其他字段\n```\n\n此外，你还可以向 `DocList` 中插入、删除或追加 Document：\n\n```python\n# 追加\ndl.append(\n    Image(\n        url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n        tensor=np.zeros((3, 224, 224)),\n    )\n)\n# 删除\ndel dl[0]\n# 插入\ndl.insert(\n    0,\n    Image(\n        url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n        tensor=np.zeros((3, 224, 224)),\n    ),\n)\n```\n\n你还可以无缝地在 `DocVec` 和 `DocList` 之间切换：\n\n```python\nvec_2 = dl.to_doc_vec()\nassert isinstance(vec_2, DocVec)\n\ndl_2 = vec_2.to_doc_list()\nassert isinstance(dl_2, DocList)\n```\n\n\u003C\u002Fdetails>\n\n## 发送\n\nDocArray 能够以与机器学习天然兼容的方式促进**数据的传输**。\n\n这包括对 **Protobuf 和 gRPC** 的原生支持，以及对 **HTTP**、JSON、JSONSchema、Base64 和字节序列化的支持。\n\n这一特性在以下几种场景中非常有用：\n\n- :cloud: 你正在**部署模型服务**，例如通过 **[Jina](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fjina\u002F)** 或 **[FastAPI](https:\u002F\u002Fgithub.com\u002Ftiangolo\u002Ffastapi\u002F)** 等框架。\n- :spider_web: 你正在**将模型分布到多台机器上**，需要一种高效的数据传输方式。\n- :gear: 你正在构建**微服务架构**，需要一种在微服务之间传输数据的方法。\n\n> :bulb: **你熟悉 FastAPI 吗？** 你会很高兴地知道，DocArray 与 FastAPI 完全兼容！\n> 此外，我们还有一个专门为你准备的 [部分](#coming-from-fastapi)！\n\n在数据传输过程中，序列化是一个关键步骤。让我们深入了解一下 DocArray 如何简化这一过程：\n\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.typing import ImageTorchTensor\nimport torch\n\n\n# 建模你的数据\nclass MyDocument(BaseDoc):\n    description: str\n    image: ImageTorchTensor[3, 224, 224]\n\n\n# 创建一个 Document\ndoc = MyDocument(\n    description=\"这是一个描述\",\n    image=torch.zeros((3, 224, 224)),\n)\n\n# 序列化它！\nproto = doc.to_protobuf()\nbytes_ = doc.to_bytes()\njson = doc.json()\n\n# 反序列化它！\ndoc_2 = MyDocument.from_protobuf(proto)\ndoc_4 = MyDocument.from_bytes(bytes_)\ndoc_5 = MyDocument.parse_raw(json)\n```\n\n当然，仅仅序列化是不够的。接下来，让我们看看 DocArray 如何与 **[Jina](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fjina\u002F)** 和 **[FastAPI](https:\u002F\u002Fgithub.com\u002Ftiangolo\u002Ffastapi\u002F)** 集成。\n\n## 存储\n\n在对数据进行建模并可能将其分发之后，你通常会希望将其**存储起来**。这时，DocArray 就派上用场了！\n\n**Document Store** 提供了一种无缝的方式来存储你的 Document，正如其名称所示。无论是在本地还是远程，你都可以通过相同的用户界面完成操作：\n\n- :cd: **本地磁盘**，作为文件保存在你的本地文件系统中。\n- :bucket: 在 **[AWS S3](https:\u002F\u002Faws.amazon.com\u002Fde\u002Fs3\u002F)** 上。\n- :cloud: 在 **[Jina AI Cloud](https:\u002F\u002Fcloud.jina.ai\u002F)** 上。\n\nDocument Store 的界面允许你从多个数据源推送和拉取 Document，而这一切都通过同一个用户界面完成。\n\n例如，让我们看看如何使用本地磁盘存储：\n\n```python\nfrom docarray import BaseDoc, DocList\n\n\nclass SimpleDoc(BaseDoc):\n    text: str\n\n\ndocs = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(8)])\ndocs.push('file:\u002F\u002Fsimple_docs')\n\ndocs_pull = DocList[SimpleDoc].pull('file:\u002F\u002Fsimple_docs')\n```\n\n## 检索\n\n**文档索引** 允许您将文档索引到 **向量数据库** 中，以便高效地进行基于相似度的检索。\n\n这在以下场景中非常有用：\n\n- :left_speech_bubble: 使用领域知识增强 **LLM 和聊天机器人**（[检索增强生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.11401)）\n- :mag: **神经搜索** 应用\n- :bulb: **推荐系统**\n\n目前，文档索引支持 **[Weaviate](https:\u002F\u002Fweaviate.io\u002F)**、**[Qdrant](https:\u002F\u002Fqdrant.tech\u002F)**、**[ElasticSearch](https:\u002F\u002Fwww.elastic.co\u002F)**、**[Redis](https:\u002F\u002Fredis.io\u002F)**、**[Mongo Atlas](https:\u002F\u002Fwww.mongodb.com\u002F)** 以及 **[HNSWLib](https:\u002F\u002Fgithub.com\u002Fnmslib\u002Fhnswlib)**，未来还将支持更多！\n\n文档索引接口允许您从多个向量数据库中索引和检索文档，且所有操作都使用相同的用户界面。\n它支持近似最近邻向量搜索、文本搜索、过滤以及混合搜索。\n\n```python\nfrom docarray import DocList, BaseDoc\nfrom docarray.index import HnswDocumentIndex\nimport numpy as np\n\nfrom docarray.typing import ImageUrl, ImageTensor, NdArray\n\n\nclass ImageDoc(BaseDoc):\n    url: ImageUrl\n    tensor: ImageTensor\n    embedding: NdArray[128]\n\n\n# 创建一些数据\ndl = DocList[ImageDoc](\n    [\n        ImageDoc(\n            url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n            tensor=np.zeros((3, 224, 224)),\n            embedding=np.random.random((128,)),\n        )\n        for _ in range(100)\n    ]\n)\n\n# 创建一个文档索引\nindex = HnswDocumentIndex[ImageDoc](work_dir='\u002Ftmp\u002Ftest_index')\n\n\n# 索引您的数据\nindex.index(dl)\n\n# 查找相似文档\nquery = dl[0]\nresults, scores = index.find(query, limit=10, search_field='embedding')\n```\n\n---\n\n## 学习 DocArray\n\n根据您的背景和使用场景，您可以采用不同的方式来理解 DocArray。\n\n### 来自 DocArray ≤0.21 的用户\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>点击展开\u003C\u002Fsummary>\n\n如果您正在使用 DocArray 0.30.0 或更低版本，您应该熟悉其 [数据类 API](https:\u002F\u002Fdocarray.jina.ai\u002Ffundamentals\u002Fdataclass\u002F)。\n\n_DocArray ≥0.30 是这一理念的进一步发展。_ 每个文档都是通过类似数据类的接口创建的，\n这得益于 [Pydantic](https:\u002F\u002Fpydantic-docs.helpmanual.io\u002Fusage\u002Fmodels\u002F)。\n\n这带来了以下优势：\n- **灵活性：** 无需遵循固定的字段集——您的数据定义了模式。\n- **多模态性：** 文档本质上只是字典。这使得它们可以轻松地从任何语言创建并发送，而不仅仅是 Python。\n\n您可能也熟悉我们用于向量数据库集成的旧版文档存储。\n现在它们被称为 **文档索引**，并提供了以下改进（有关新 API，请参阅 [此处](#store)）：\n\n- **混合搜索：** 您现在可以将向量搜索与文本搜索结合，甚至可以根据任意字段进行过滤。\n- **生产就绪：** 新的文档索引是对各种向量数据库库的更轻量封装，使其更加健壮且易于维护。\n- **更高的灵活性：** 我们致力于支持您可以通过数据库官方客户端执行的任何配置或设置。\n\n目前，文档索引支持 **[Weaviate](https:\u002F\u002Fweaviate.io\u002F)**、**[Qdrant](https:\u002F\u002Fqdrant.tech\u002F)**、**[ElasticSearch](https:\u002F\u002Fwww.elastic.co\u002F)**、**[Redis](https:\u002F\u002Fredis.io\u002F)**、**[Mongo Atlas](https:\u002F\u002Fwww.mongodb.com\u002F)**、精确最近邻搜索以及 **[HNSWLib](https:\u002F\u002Fgithub.com\u002Fnmslib\u002Fhnswlib)**，未来还将支持更多。\n\n\u003C\u002Fdetails>\n\n### 来自 Pydantic 的用户\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>点击展开\u003C\u002Fsummary>\n\n如果您来自 Pydantic 社区，可以将 DocArray 文档视为功能增强的 Pydantic 模型，而 DocArray 则是围绕这些模型的一系列扩展工具。\n\n具体来说，我们的目标是 **让 Pydantic 更适合机器学习领域**——不是取代它，而是在此基础上构建！\n\n这意味着您将获得以下好处：\n\n- **面向机器学习的类型：** Tensor、TorchTensor、Embedding 等，包括 **张量形状验证**。\n- 与 **FastAPI**  volle 兼容。\n- **DocList** 和 **DocVec** 将模型的概念推广到模型的 _序列_ 或 _批次_。非常适合 **用于机器学习模型** 和其他批量处理任务。\n- **活体类型：** ImageUrl 可以 `.load()` URL 并转换为图像张量，TextUrl 可以加载并分词文本文档等。\n- 云就绪：支持序列化为 **Protobuf**，适用于微服务和 **gRPC**。\n- **预建的多模态文档**，适用于不同数据模态：图像、文本、3D网格、视频、音频等。请注意，所有这些都符合 Pydantic 模型的标准！\n- **文档存储** 和 **文档索引** 让您能够存储数据，并通过 **向量搜索** 进行检索。\n\n这里最明显的优势是 **对以机器学习为中心的数据提供一流的支持**，例如 `{Torch, TF, ...}Tensor`、`Embedding` 等。\n\n这还包括一些实用的功能，比如验证张量的形状：\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.typing import TorchTensor\nimport torch\n\n\nclass MyDoc(BaseDoc):\n    tensor: TorchTensor[3, 224, 224]\n\n\ndoc = MyDoc(tensor=torch.zeros(3, 224, 224))  # 成功\ndoc = MyDoc(tensor=torch.zeros(224, 224, 3))  # 通过重塑成功\n\ntry:\n    doc = MyDoc(tensor=torch.zeros(224))  # 验证失败\nexcept Exception as e:\n    print(e)\n    # tensor\n    # 无法将形状为 (224,) 的张量重塑为 (3, 224, 224)（类型=value_error）\n\n\nclass Image(BaseDoc):\n    tensor: TorchTensor[3, 'x', 'x']\n\n\nImage(tensor=torch.zeros(3, 224, 224))  # 成功\n\ntry:\n    Image(\n        tensor=torch.zeros(3, 64, 128)\n    )  # 验证失败，因为第二维度与第三维度不匹配\nexcept Exception as e:\n    print()\n\n\ntry:\n    Image(\n        tensor=torch.zeros(4, 224, 224)\n    )  # 验证失败，因为第一维度不符合要求\nexcept Exception as e:\n    print(e)\n    # 张量形状不匹配。预期 (3, 'x', 'x')，实际得到 (4, 224, 224)（类型=value_error）\n\ntry:\n    Image(\n        tensor=torch.zeros(3, 64)\n    )  # 验证失败，因为维度不足\nexcept Exception as e:\n    print(e)\n    # 张量形状不匹配。预期 (3, 'x', 'x')，实际得到 (3, 64)（类型=value_error）\n```\n\n\u003C\u002Fdetails>\n\n### 来自 PyTorch\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>点击展开\u003C\u002Fsummary>\n\n如果你来自 PyTorch 生态，可以将 DocArray 看作一种 _在数据流经模型时组织数据_ 的方式。\n\n它为你提供了多项优势：\n\n- 在类型提示中表达 **张量的形状**\n- **将属于同一对象的张量分组**，例如一段音频和一张图像\n- **直接部署上线**，通过复用你的数据模型作为 [FastAPI](https:\u002F\u002Ffastapi.tiangolo.com\u002F) 或 [Jina](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fjina) API 的 Schema\n- 使用 Protobuf 和 gRPC 在 **微服务之间** 连接模型组件\n\nDocArray 可以直接用于机器学习模型中，以处理和表示多模态数据。这使得你能够在 `nn.Module` 的深层逻辑中利用 DocArray 的抽象来分析数据，并提供与 FastAPI 兼容的 Schema，从而简化从模型训练到模型推理部署的过渡。\n\n为了更好地理解这一点，我们先来看一个原生 PyTorch 实现的三模态机器学习模型：\n\n```python\nimport torch\nfrom torch import nn\n\n\ndef encoder(x):\n    return torch.rand(512)\n\n\nclass MyMultiModalModel(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.audio_encoder = encoder()\n        self.image_encoder = encoder()\n        self.text_encoder = encoder()\n\n    def forward(self, text_1, text_2, image_1, image_2, audio_1, audio_2):\n        embedding_text_1 = self.text_encoder(text_1)\n        embedding_text_2 = self.text_encoder(text_2)\n\n        embedding_image_1 = self.image_encoder(image_1)\n        embedding_image_2 = self.image_encoder(image_2)\n\n        embedding_audio_1 = self.image_encoder(audio_1)\n        embedding_audio_2 = self.image_encoder(audio_2)\n\n        return (\n            embedding_text_1,\n            embedding_text_2,\n            embedding_image_1,\n            embedding_image_2,\n            embedding_audio_1,\n            embedding_audio_2,\n        )\n```\n\n坦白说，这段代码可读性并不高。更糟糕的是，如果需要再增加一种模态，你就不得不修改整个代码库：既要更改 `forward()` 方法的返回值类型，还要对后续所有依赖这部分输出的代码进行大量调整。\n\n接下来，让我们看看使用 DocArray 后同样的代码会是什么样子：\n\n```python\nfrom docarray import DocList, BaseDoc\nfrom docarray.documents import ImageDoc, TextDoc, AudioDoc\nfrom docarray.typing import TorchTensor\nfrom torch import nn\nimport torch\n\n\ndef encoder(x):\n    return torch.rand(512)\n\n\nclass Podcast(BaseDoc):\n    text: TextDoc\n    image: ImageDoc\n    audio: AudioDoc\n\n\nclass PairPodcast(BaseDoc):\n    left: Podcast\n    right: Podcast\n\n\nclass MyPodcastModel(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.audio_encoder = encoder()\n        self.image_encoder = encoder()\n        self.text_encoder = encoder()\n\n    def forward_podcast(self, docs: DocList[Podcast]) -> DocList[Podcast]:\n        docs.audio.embedding = self.audio_encoder(docs.audio.tensor)\n        docs.text.embedding = self.text_encoder(docs.text.tensor)\n        docs.image.embedding = self.image_encoder(docs.image.tensor)\n\n        return docs\n\n    def forward(self, docs: DocList[PairPodcast]) -> DocList[PairPodcast]:\n        docs.left = self.forward_podcast(docs.left)\n        docs.right = self.forward_podcast(docs.right)\n\n        return docs\n```\n\n是不是清晰多了？代码的可读性和可维护性瞬间提升。而且只需稍加改动，你就能将 PyTorch 模型转换为 FastAPI 应用，并复用你的 Document Schema 定义（详见下方“来自 FastAPI”部分）。这一切都通过 Python 式的类型提示来实现。\n\n\u003C\u002Fdetails>\n\n\n### 来自 TensorFlow\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>点击展开\u003C\u002Fsummary>\n\n与 [PyTorch 方案](#coming-from-pytorch) 类似，你也可以在 TensorFlow 中使用 DocArray 来处理和表示多模态数据。\n\n首先，要在 TensorFlow 中使用 DocArray，你需要按照以下步骤安装：\n\n```\npip install tensorflow==2.12.0\npip install protobuf==3.19.0\n```\n\n与在 PyTorch 中使用 DocArray 相比，在 TensorFlow 中使用时有一个主要区别：DocArray 的 `TorchTensor` 是 `torch.Tensor` 的子类，而 `TensorFlowTensor` 则不是。由于 `tf.Tensor` 的一些技术限制，DocArray 的 `TensorFlowTensor` 并非 `tf.Tensor` 的子类，而是将其存储在 `.tensor` 属性中。\n\n这会对你产生什么影响呢？每当你需要访问张量数据——比如对其进行操作或传递给你的机器学习模型时——就不能直接传递 `TensorFlowTensor` 实例，而必须通过其 `.tensor` 属性来获取张量本身。\n\n示例如下：\n\n```python\nfrom typing import Optional\n\nfrom docarray import DocList, BaseDoc\n\nimport tensorflow as tf\n\n\nclass Podcast(BaseDoc):\n    audio_tensor: Optional[AudioTensorFlowTensor] = None\n    embedding: Optional[AudioTensorFlowTensor] = None\n\n\nclass MyPodcastModel(tf.keras.Model):\n    def __init__(self):\n        super().__init__()\n        self.audio_encoder = AudioEncoder()\n\n    def call(self, inputs: DocList[Podcast]) -> DocList[Podcast]:\n        inputs.audio_tensor.embedding = self.audio_encoder(\n            inputs.audio_tensor.tensor\n        )  # 访问 audio_tensor 的 .tensor 属性\n        return inputs\n```\n\n\u003C\u002Fdetails>\n\n### 来自 FastAPI\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>点击展开\u003C\u002Fsummary>\n\n文档是 Pydantic 模型（略有不同），因此它们与 FastAPI 完全兼容！\n\n但为什么你应该使用它们，而不是你已经熟悉并喜爱的 Pydantic 模型呢？\n这是个好问题！\n\n- 因为它们具备以机器学习为中心的功能、类型和验证，[详见此处](#coming-from-pydantic)\n- 因为 DocArray 可以充当[向量数据库的 ORM](#coming-from-a-vector-database)，类似于 SQLModel 对 SQL 数据库的作用\n\n为了进一步说明，我们来展示文档如何轻松地融入你的 FastAPI 应用：\n\n```python\nimport numpy as np\nfrom fastapi import FastAPI\nfrom docarray.base_doc import DocArrayResponse\nfrom docarray import BaseDoc\nfrom docarray.documents import ImageDoc\nfrom docarray.typing import NdArray, ImageTensor\n\n\nclass InputDoc(BaseDoc):\n    img: ImageDoc\n    text: str\n\n\nclass OutputDoc(BaseDoc):\n    embedding_clip: NdArray\n    embedding_bert: NdArray\n\n\napp = FastAPI()\n\n\ndef model_img(img: ImageTensor) -> NdArray:\n    return np.zeros((100, 1))\n\n\ndef model_text(text: str) -> NdArray:\n    return np.zeros((100, 1))\n\n\n@app.post(\"\u002Fembed\u002F\", response_model=OutputDoc, response_class=DocArrayResponse)\nasync def create_item(doc: InputDoc) -> OutputDoc:\n    doc = OutputDoc(\n        embedding_clip=model_img(doc.img.tensor), embedding_bert=model_text(doc.text)\n    )\n    return doc\n\n\ninput_doc = InputDoc(text='', img=ImageDoc(tensor=np.random.random((3, 224, 224))))\n\nasync with AsyncClient(app=app, base_url=\"http:\u002F\u002Ftest\") as ac:\n    response = await ac.post(\"\u002Fembed\u002F\", data=input_doc.json())\n```\n\n就像普通的 Pydantic 模型一样！\n\n\u003C\u002Fdetails>\n\n### 来自 Jina\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>点击展开\u003C\u002Fsummary>\n\nJina 已经采用 DocArray 作为其用于表示和序列化文档的库。\n\nJina 允许你部署和扩展使用 DocArray 构建的模型和服务，从而充分利用 DocArray 的序列化能力。\n\n```python\nimport numpy as np\nfrom jina import Deployment, Executor, requests\nfrom docarray import BaseDoc, DocList\nfrom docarray.documents import ImageDoc\nfrom docarray.typing import NdArray, ImageTensor\n\n\nclass InputDoc(BaseDoc):\n    img: ImageDoc\n    text: str\n\n\nclass OutputDoc(BaseDoc):\n    embedding_clip: NdArray\n    embedding_bert: NdArray\n\n\ndef model_img(img: ImageTensor) -> NdArray:\n    return np.zeros((100, 1))\n\n\ndef model_text(text: str) -> NdArray:\n    return np.zeros((100, 1))\n\n\nclass MyEmbeddingExecutor(Executor):\n    @requests(on='\u002Fembed')\n    def encode(self, docs: DocList[InputDoc], **kwargs) -> DocList[OutputDoc]:\n        ret = DocList[OutputDoc]()\n        for doc in docs:\n            output = OutputDoc(\n                embedding_clip=model_img(doc.img.tensor),\n                embedding_bert=model_text(doc.text),\n            )\n            ret.append(output)\n        return ret\n\n\nwith Deployment(\n    protocols=['grpc', 'http'], ports=[12345, 12346], uses=MyEmbeddingExecutor\n) as dep:\n    resp = dep.post(\n        on='\u002Fembed',\n        inputs=DocList[InputDoc](\n            [InputDoc(text='', img=ImageDoc(tensor=np.random.random((3, 224, 224)))]\n        ),\n        return_type=DocList[OutputDoc],\n    )\n    print(resp)\n```\n\n\u003C\u002Fdetails>\n\n### 来自向量数据库\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>点击展开\u003C\u002Fsummary>\n\n如果你将 DocArray 视为通用的向量数据库客户端，那么你可以把它看作**一种新型的向量数据库 ORM**。\nDocArray 的作用是将多模态、嵌套且领域特定的数据映射到向量数据库中，在那里存储并使其可搜索：\n\n```python\nfrom docarray import DocList, BaseDoc\nfrom docarray.index import HnswDocumentIndex\nimport numpy as np\n\nfrom docarray.typing import ImageUrl, ImageTensor, NdArray\n\n\nclass ImageDoc(BaseDoc):\n    url: ImageUrl\n    tensor: ImageTensor\n    embedding: NdArray[128]\n\n\n# 创建一些数据\ndl = DocList[ImageDoc](\n    [\n        ImageDoc(\n            url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n            tensor=np.zeros((3, 224, 224)),\n            embedding=np.random.random((128,)),\n        )\n        for _ in range(100)\n    ]\n)\n\n# 创建一个文档索引\nindex = HnswDocumentIndex[ImageDoc](work_dir='\u002Ftmp\u002Ftest_index2')\n\n\n# 索引你的数据\nindex.index(dl)\n\n# 查找相似的文档\nquery = dl[0]\nresults, scores = index.find(query, limit=10, search_field='embedding')\n```\n\n目前，DocArray 支持以下向量数据库：\n\n- [Weaviate](https:\u002F\u002Fwww.weaviate.io\u002F)\n- [Qdrant](https:\u002F\u002Fqdrant.tech\u002F)\n- [Elasticsearch](https:\u002F\u002Fwww.elastic.co\u002Felasticsearch\u002F) v8 和 v7\n- [Redis](https:\u002F\u002Fredis.io\u002F)\n- [Milvus](https:\u002F\u002Fmilvus.io)\n- ExactNNMemorySearch 作为一种本地替代方案，提供精确的 kNN 搜索。\n- [HNSWlib](https:\u002F\u002Fgithub.com\u002Fnmslib\u002Fhnswlib) 作为一种本地优先的近似最近邻搜索替代方案\n- [Mongo Atlas](https:\u002F\u002Fwww.mongodb.com\u002F)\n\n目前正在开发对 [OpenSearch](https:\u002F\u002Fopensearch.org\u002F) 的集成。\n\n当然，这仅仅是 DocArray 能够做到的事情之一，所以我们鼓励你查看本 README 的其余部分！\n\n\u003C\u002Fdetails>\n\n\n### 来自 Langchain\n\n\u003Cdetails markdown=\"1\">\n  \u003Csummary>点击展开\u003C\u002Fsummary>\n\n借助 DocArray，你可以通过 Langchain 将外部数据连接到 LLM。DocArray 让你能够自由定义灵活的文档模式，并从不同的后端中选择文档存储方式。\n创建文档索引后，你可以使用 [DocArrayRetriever](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fmodules\u002Fdata_connection\u002Fretrievers\u002Fintegrations\u002Fdocarray_retriever) 将其连接到你的 Langchain 应用程序。\n\n通过以下命令安装 Langchain：\n```shell\npip install langchain\n```\n\n1. 定义模式并创建文档：\n```python\nfrom docarray import BaseDoc, DocList\nfrom docarray.typing import NdArray\nfrom langchain.embeddings.openai import OpenAIEmbeddings\n\nembeddings = OpenAIEmbeddings()\n\n\n# 定义文档模式\nclass MovieDoc(BaseDoc):\n    title: str\n    description: str\n    year: int\n    embedding: NdArray[1536]\n\n\nmovies = [\n    {\"title\": \"#1 title\", \"description\": \"#1 description\", \"year\": 1999},\n    {\"title\": \"#2 title\", \"description\": \"#2 description\", \"year\": 2001},\n]\n\n# 嵌入“description”并创建文档\ndocs = DocList[MovieDoc](\n    MovieDoc(embedding=embeddings.embed_query(movie[\"description\"]), **movie)\n    for movie in movies\n)\n```\n\n2. 使用任何支持的后端初始化文档索引：\n```python\nfrom docarray.index import (\n    InMemoryExactNNIndex,\n    HnswDocumentIndex,\n    WeaviateDocumentIndex,\n    QdrantDocumentIndex,\n    ElasticDocIndex,\n    RedisDocumentIndex,\n    MongoDBAtlasDocumentIndex,\n)\n\n# 选择合适的后端并用数据初始化\ndb = InMemoryExactNNIndex[MovieDoc](docs)\n```\n\n3. 最后，初始化检索器并将其集成到你的链中！\n```python\nfrom langchain.chat_models import ChatOpenAI\nfrom langchain.chains import ConversationalRetrievalChain\nfrom langchain.retrievers import DocArrayRetriever\n\n# 创建一个检索器\nretriever = DocArrayRetriever(\n    index=db,\n    embeddings=embeddings,\n    search_field=\"embedding\",\n    content_field=\"description\",\n)\n\n# 在你的链中使用该检索器\nmodel = ChatOpenAI()\nqa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)\n```\n\n或者，你也可以使用内置的向量存储。Langchain 支持两种向量存储：[DocArrayInMemorySearch](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fmodules\u002Fdata_connection\u002Fvectorstores\u002Fintegrations\u002Fdocarray_in_memory) 和 [DocArrayHnswSearch](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fmodules\u002Fdata_connection\u002Fvectorstores\u002Fintegrations\u002Fdocarray_hnsw)。\n这两种存储都易于使用，最适合中小型数据集。\n\n\u003C\u002Fdetails>\n\n\n## 参阅\n\n- [文档](https:\u002F\u002Fdocs.docarray.org)\n- [DocArray≤0.21 文档](https:\u002F\u002Fdocarray.jina.ai\u002F)\n- [加入我们的 Discord 服务器](https:\u002F\u002Fdiscord.gg\u002FWaMp6PVPgR)\n- [向 Linux 基金会 AI&Data 捐赠的博客文章](https:\u002F\u002Fjina.ai\u002Fnews\u002Fdonate-docarray-lf-for-inclusive-standard-multimodal-data-model\u002F)\n- [路线图](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fissues\u002F1714)\n\n> DocArray 是 LF AI Projects, LLC 的注册商标","# DocArray 快速上手指南\n\nDocArray 是一个专为多模态数据（如文本、图像、音频、视频等）设计的 Python 库。它提供了统一的数据结构，用于数据的表示、传输、存储和检索，能够无缝集成 PyTorch、TensorFlow、JAX 等主流深度学习框架以及 FastAPI、Jina 等微服务框架。\n\n## 环境准备\n\n*   **操作系统**：Linux, macOS, Windows\n*   **Python 版本**：建议 Python 3.8 或更高版本\n*   **前置依赖**：\n    *   本库基于 **Pydantic** 构建，安装时会自动处理依赖。\n    *   根据实际需求，建议预先安装深度学习框架（如 `torch`, `tensorflow`, `jax`）以启用完整的张量类型支持。\n\n## 安装步骤\n\n使用 pip 进行安装。国内开发者推荐使用清华源或阿里源以加速下载。\n\n```shell\n# 官方源安装\npip install -U docarray\n\n# 推荐：使用清华大学镜像源加速安装\npip install -U docarray -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n> **注意**：本文档适用于 DocArray v0.30+ 版本。如果您需要使用旧版 (v0.21)，请运行 `pip install docarray==0.21`。\n\n## 基本使用\n\n### 1. 定义多模态数据结构\n\nDocArray 允许你像定义 Python 数据类一样定义数据结构，并直接指定张量（Tensor）的形状和多模态字段类型。\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.typing import TorchTensor, ImageUrl\nimport torch\n\n\n# 定义你的数据模型\nclass MyDocument(BaseDoc):\n    description: str\n    image_url: ImageUrl  # 支持 ImageUrl, VideoUrl, AudioUrl 等\n    # 可以直接指定张量的具体形状 [高，宽，通道]\n    image_tensor: TorchTensor[1704, 2272, 3] \n```\n\n### 2. 创建与处理数据实例\n\n你可以轻松创建文档实例，并利用内置方法加载数据或计算嵌入向量。\n\n```python\n# 创建一个文档实例\ndoc = MyDocument(\n    description=\"这是一张山的照片\",\n    image_url=\"https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F2\u002F2f\u002FAlpamayo.jpg\",\n)\n\n# 从 URL 加载图像张量\ndoc.image_tensor = doc.image_url.load()\n\n# 模拟一个编码函数（实际使用中可替换为 CLIP 等模型）\ndef clip_image_encoder(image_tensor: TorchTensor) -> TorchTensor:\n    return torch.rand(512)\n\n# 计算嵌入向量\ndoc.embedding = clip_image_encoder(doc.image_tensor)\n\nprint(doc.embedding.shape)  # 输出：torch.Size([512])\n```\n\n### 3. 批量数据处理 (DocVec vs DocList)\n\n在机器学习场景中，通常需要批量处理数据。DocArray 提供两种核心容器：\n\n*   **`DocVec`**：将多个文档中的张量堆叠成一个大张量。**适用于模型训练和批量推理**，性能极高。\n*   **`DocList`**：保持文档列表形式，张量不堆叠。**适用于数据流式传输、重排序或动态增删**。\n\n#### 使用 DocVec 进行批量操作\n\n```python\nfrom docarray import DocVec, BaseDoc\nfrom docarray.typing import AnyTensor, ImageUrl\nimport numpy as np\n\nclass Image(BaseDoc):\n    url: ImageUrl\n    tensor: AnyTensor  # 兼容 torch, numpy, tensorflow\n\n# 初始化 DocVec，所有张量自动堆叠\nvec = DocVec[Image](\n    [\n        Image(\n            url=\"https:\u002F\u002Fexample.com\u002Fimage.jpg\",\n            tensor=np.zeros((3, 224, 224)),\n        )\n        for _ in range(100)  # 创建 100 个样本\n    ]\n)\n\n# 批量访问字段：直接获取堆叠后的大张量\nbatch_tensors = vec.tensor \nprint(batch_tensors.shape)  # 输出：(100, 3, 224, 224)\n```\n\n#### 灵活转换\n\n你可以根据需求在 `DocVec` 和 `DocList` 之间无缝切换：\n\n```python\nfrom docarray import DocList\n\n# 转换为列表模式（适合后续单独处理或追加数据）\ndl = vec.to_doc_list()\n\n# 也可以从列表转回向量模式\nvec_new = dl.to_doc_vec()\n```\n\n### 4. 数据序列化与传输\n\nDocArray 原生支持多种序列化格式，方便通过 HTTP (JSON) 或 gRPC (Protobuf) 传输数据。\n\n```python\n# 序列化为不同格式\nproto_bytes = doc.to_protobuf()\njson_str = doc.json()\nraw_bytes = doc.to_bytes()\n\n# 反序列化还原\ndoc_restored = MyDocument.from_protobuf(proto_bytes)\ndoc_from_json = MyDocument.parse_raw(json_str)\n```","某电商初创团队正在开发一个“以图搜同款”功能，需要同时处理商品图片、文本描述及价格标签等多模态数据，并将其存入向量数据库以供快速检索。\n\n### 没有 docarray 时\n- **数据结构混乱**：开发者需手动编写类来对齐图像张量（PyTorch\u002FTensorFlow）与文本元数据，字段类型检查全靠自觉，极易出错。\n- **转换代码冗余**：在模型推理、API 传输（JSON\u002FProtobuf）和数据库存储之间，需要反复编写繁琐的格式转换代码，维护成本极高。\n- **生态集成困难**：对接 Weaviate 或 Qdrant 等向量库时，缺乏统一接口，每次更换后端都要重构大量数据序列化逻辑。\n- **调试效率低下**：由于缺乏标准化的多模态数据结构，排查数据维度不匹配或类型错误往往耗费数小时。\n\n### 使用 docarray 后\n- **统一数据建模**：利用基于 Pydantic 的 `Document` 结构，一行代码即可定义包含图像张量、文本嵌入和标量的强类型对象，自动校验数据完整性。\n- **无缝流转互通**：docarray 原生支持将数据直接序列化为 JSON 过 HTTP 或 Protobuf 过 gRPC，并能直接与 PyTorch\u002FJAX 模型交互，消除了中间转换层。\n- **一键向量入库**：内置对 Weaviate、Qdrant 等主流向量库的支持，只需调用简单接口即可完成多模态数据的索引构建与相似度搜索。\n- **开发聚焦业务**：团队不再纠结于底层数据搬运，可将精力集中在优化检索算法和提升用户体验上，新功能上线周期缩短一半。\n\ndocarray 通过提供专为多模态 AI 设计的标准化数据结构，彻底打通了从模型训练到生产部署的数据链路，让复杂的多模态应用开发变得像操作普通 Python 对象一样简单。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdocarray_docarray_02a5f7b1.png","DocArray","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fdocarray_a674d9c7.png","DocArray is an open source project offering the data structure for multimodal data. It is hosted in incubation in the LF AI & Data Foundation.",null,"info@lfaidata.foundation","https:\u002F\u002Fdocs.docarray.org","https:\u002F\u002Fgithub.com\u002Fdocarray",[81,85,89,93],{"name":82,"color":83,"percentage":84},"Python","#3572A5",99.6,{"name":86,"color":87,"percentage":88},"Shell","#89e051",0.4,{"name":90,"color":91,"percentage":92},"HTML","#e34c26",0,{"name":94,"color":95,"percentage":92},"CSS","#663399",3117,240,"2026-04-10T19:12:22","Apache-2.0",1,"未说明","非必需。支持多种后端（NumPy, PyTorch, TensorFlow, JAX），GPU 需求取决于用户选择的具体深度学习框架及模型训练场景，工具本身无强制显卡型号或显存要求。",{"notes":104,"python":101,"dependencies":105},"DocArray 是一个用于多模态数据表示、传输、存储和检索的 Python 库。它基于 Pydantic 构建，与 FastAPI 和 Jina 无缝集成。用户可根据需要选择安装不同的深度学习后端（如 PyTorch、TensorFlow、JAX 或仅使用 NumPy）。当前 README 主要针对 v0.30+ 版本，若需使用旧版 (\u003C=0.21) 需指定版本号安装。支持通过 HTTP\u002FJSON 或 gRPC\u002FProtobuf 进行数据传输，并兼容多种向量数据库（如 Weaviate, Qdrant, Redis 等）。",[106,107,108,109,110],"pydantic","numpy","torch (可选)","tensorflow (可选)","jax (可选)",[52,16,14,112],"其他",[64,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,106],"data-structures","multimodal","cross-modal","neural-search","deep-learning","nested-data","qdrant","weaviate","nearest-neighbor-search","protobuf","elasticsearch","dataclass","multi-modal","semantic-search","machine-learning","pytorch","fastapi","2026-03-27T02:49:30.150509","2026-04-16T01:43:14.506930",[134,139,144,149,154,159,164],{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},35149,"如何在 FastAPI 中为 DocArray 的 BaseDoc 生成 Swagger UI 文档？","Swagger UI 的文档实际上已经存在。您可以参考官方文档中的 'Send Over FastAPI' 章节，其中详细介绍了如何创建带有 Swagger UI 的 FastAPI 应用并与 BaseDoc 配合使用。文档地址：https:\u002F\u002Fdocs.docarray.org\u002Fuser_guide\u002Fsending\u002Fapi\u002FfastAPI\u002F","https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fissues\u002F1558",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},35150,"安装 docarray[common] 后仍然报错提示缺少 google.protobuf 库怎么办？","这是预期的行为，有时即使安装了 `docarray[common]`，protobuf 依赖也可能未正确解析。如果遇到问题，建议直接使用 `save_binary` 方法来保存数据，而不是使用 `push\u002Fpull file:\u002F\u002F` 方式，后者在某些环境下表现不稳定。","https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fissues\u002F1420",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},35151,"在使用 numpy==1.26.1 时从 DataFrame 反序列化数据遇到错误如何解决？","这是一个已知的边缘案例问题。在修复版本发布之前，您可以通过降级 numpy 版本来绕过此问题。维护者建议暂时使用较低版本的 numpy（例如 1.24.x 或 1.25.x）直到包含修复的新版本正式发布。","https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fissues\u002F1821",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},35152,"索引嵌套的 DocList 时出现 'RuntimeError: The number of elements exceeds the specified limit' 错误怎么办？","该问题已在 Pull Request #1602 中得到修复。请确保您将 docarray 升级到包含此修复的最新版本。如果问题仍然存在，请检查您的嵌套数据结构是否过于复杂或元素数量是否确实超过了系统限制。","https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fissues\u002F1585",{"id":155,"question_zh":156,"answer_zh":157,"source_url":158},35153,"BaseDoc 对象的 dict() 方法和内置的 dict(doc) 函数有什么区别？","两者行为不同：\n1. `doc.dict()`：返回包含所有字段值的字典，对于嵌套的 DocList 或复杂对象，会递归将其转换为字典格式。\n2. `dict(doc)`：当直接对 BaseDoc 对象使用时，它主要返回类属性设置的值。如果属性类型是 List、Dict、Tuple 或 DocList 等包含嵌套对象的类型，输出可能仅显示对象类型而非具体内容。\n建议优先使用 `doc.dict()` 来获取完整的结构化数据表示。","https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fissues\u002F1560",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},35154,"初始化大量的 Document 或 DocumentArray 时速度非常慢，有什么优化方法吗？","与原生 Python 列表相比，Document 和 DocumentArray 的初始化确实需要更多时间，因为它们涉及更多的元数据处理和验证。对于大规模数据（如百万级文档），建议：\n1. 避免在循环中逐个创建 Document 对象。\n2. 考虑批量处理或使用更高效的数据加载策略。\n3. 如果不需要完整的 Document 功能，可以先用普通列表处理数据，最后再转换为 DocumentArray。","https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fissues\u002F1189",{"id":165,"question_zh":166,"answer_zh":167,"source_url":168},35155,"Weaviate 最小示例在调试模式下运行正常，但整体运行时失败，特别是在 M1 Mac 上？","这个问题在较新的版本升级中可能已经得到解决。如果您仍在使用旧版本并遇到此问题，请尝试升级到最新的 docarray 和 weaviate-client 版本。此外，确保 Weaviate 服务在本地正确运行（localhost:8080），并且防火墙或端口占用没有干扰连接。","https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fissues\u002F433",[170,175,180,185,190,195,200,205,210,215,220,225,230,235,240,245,250,255,260,265],{"id":171,"version":172,"summary_zh":173,"released_at":174},280151,"v0.40.1","## 发布说明（`0.40.1`）\n\n> 发布时间：2025-03-21 08:34:40\n\n\n\n🙇 我们要感谢所有为本次新版本做出贡献的开发者！特别感谢：\nJoan Fontanals、Emmanuel Ferdman、Casey Clements、YuXuan Tay、dependabot[bot]、James Brown、Jina Dev Bot、🙇\n\n\n### 🐞 错误修复\n\n - [[```d98acb71```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002Fd98acb716e0c336a817f65b62d428ab13cf8ac42)] __-__ 修复使用 Pydantic V2 时的 DocList 模式 (#1876) (*Joan Fontanals*)\n - [[```83ebef60```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002F83ebef6087e868517681e59877008f80f1e7f113)] __-__ 更新许可证位置 (#1911) (*Emmanuel Ferdman*)\n - [[```8f4ba7cd```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002F8f4ba7cdf177f3e4ecc838eef659496d6038af03)] __-__ 使用 Docker Compose (#1905) (*YuXuan Tay*)\n - [[```febbdc42```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002Ffebbdc4291c4af7ad2058d7feebf6a3169de93e9)] __-__ 修复动态 Document 创建中的浮点数问题 (#1877) (*Joan Fontanals*)\n - [[```7c1e18ef```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002F7c1e18ef01b09ef3d864b200248c875d0d9ced29)] __-__ 修复以迭代方式创建纯 Python 类的问题 (#1867) (*Joan Fontanals*)\n\n### 📗 文档\n\n - [[```e4665e91```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002Fe4665e91b37f97a4a18a80399431d624db8ca453)] __-__ 将关于模式的提示移至通用文档索引部分 (#1868) (*Joan Fontanals*)\n - [[```8da50c92```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002F8da50c927c24b981867650399f64d4930bd7c574)] __-__ 在 contributing.md 中添加代码审查相关内容 (#1853) (*Joan Fontanals*)\n\n### 🏁 单元测试与 CI\u002FCD\n\n - [[```a162a4b0```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002Fa162a4b09f4ad8e8c5c117c0c0101541af4c00a1)] __-__ 修复发布流程 (#1922) (*Joan Fontanals*)\n - [[```82d7cee7```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002F82d7cee71ccdd4d5874985aef0567631424b5bfd)] __-__ 修复部分 CI 问题 (#1893) (*Joan Fontanals*)\n - [[```791e4a04```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002F791e4a0473afe9d9bde87733074eef0ce217d198)] __-__ 更新发布流程 (#1869) (*Joan Fontanals*)\n - [[```aa15b9ef```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002Faa15b9eff2f5293849e83291d79bf519994c3503)] __-__ 添加许可证文件 (#1861) (*Joan Fontanals*)\n\n### 🍹 其他改进\n\n - [[```b5696b22```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002Fb5696b227161f087fa32834dcd6c2d212cf82c0e)] __-__ 修复 CI 中的 Poetry 问题 (#1921) (*Joan Fontanals*)\n - [[```d3358105```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002Fd3358105db645418c3cebfc6acb0f353127364aa)] __-__ 更新 pyproject 版本 (#1919) (*Joan Fontanals*)\n - [[```40cf2962```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002F40cf29622b29be1f32595e26876593bb5f1e03be)] __-__ MongoDB Atlas：仅需两行更改即可使我们的 CI 构建通过 (#1910) (*Casey Clements*)\n - [[```75e0033a```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002F75e0033a361a31280709899e94d6f5e14ff4b8ae)] __-__ __deps__: 将 setuptools 从 65.5.1 升级到 70.0.0 (#1899) (*dependabot[bot]*)\n - [[```75a743c9```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002F75a743c99dc54","2025-03-21T08:41:24",{"id":176,"version":177,"summary_zh":178,"released_at":179},280152,"v0.40.0","## 发布说明（`0.40.0`）\n\n> 发布时间：2023-12-22 12:12:15\n\n本次发布包含 1 项新功能、3 个问题修复和 2 处文档改进。\n\n## 🆕 新功能\n### 添加 Epsilla 连接器 (#1835)\n我们已将 [Epsilla](https:\u002F\u002Fepsilla.com\u002F) 集成到 DocArray 中。\n\n以下是一个简单的使用示例：\n\n```python\nimport numpy as np\nfrom docarray import BaseDoc\nfrom docarray.index import EpsillaDocumentIndex\nfrom docarray.typing import NdArray\nfrom pydantic import Field\n\n\nclass MyDoc(BaseDoc):\n    text: str\n    embedding: NdArray[10] = Field(is_embedding=True)\n\ndocs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]\nquery = np.random.rand(10)\ndb = EpsillaDocumentIndex[MyDoc]()\ndb.index(docs)\nresults = db.find(query, limit=10)\n```\n\n在这个示例中，我们创建了一个同时包含文本和数值数据的文档类。然后，我们初始化了一个基于 Epsilla 的文档索引，并用它来索引我们的文档。最后，我们执行了一次搜索查询。\n\n## 🐞 问题修复\n### 修复 Python 3.12 中的类型提示错误 (#1840)\nDocArray 的类型提示现已支持 Python 3.12。\n\n### 修复复杂模式的序列化与反序列化问题 (#1836)\n在对包含字典和其他复杂结构的嵌套文档的 `protobuf` 文档进行序列化和反序列化时，曾出现过问题。\n\n### 修复 TorchTensor 类中的存储问题 (#1833)\n当对 `TorchTensor` 对象进行深拷贝时，如果其 `dtype` 不是 `float32`，就会出现 bug。该问题现已修复。\n\n## 📗 文档改进\n* 添加 Epsilla 集成指南 ([docs(epsilla): 添加 Epsilla 集成指南 #1838](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fpull\u002F1838))\n* 修复文档中的签名提交命令 ([docs: 修复文档中的签名提交命令 #1834](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fpull\u002F1834))\n\n## 🤟 贡献者\n我们感谢本次发布的所有贡献者：\n\n* Tony Yang (@tonyyanga )\n* Naymul Islam (@ai-naymul )\n* Ben Shaver (@bpshaver )\n* Joan Fontanals (@@JoanFM)\n* 954 (@954-Ivory )","2023-12-22T12:12:56",{"id":181,"version":182,"summary_zh":183,"released_at":184},280153,"v0.39.1","## 发布说明 (`0.39.1`)\n\n> 发布时间：2023-10-23 08:56:38\n\n本次发布包含 2 个错误修复。\n\n## 🐞 错误修复\n\n### 使用 numpy==1.26.1 的 `from_dataframe` 方法 (#1823)\n最近对 numpy 的更新改变了一些版本控制语义，导致 DocArray 的 `from_dataframe()` 方法在数据框中包含 numpy 数组时出现中断。此问题现已修复。\n\n```python\nclass MyDoc(BaseDoc):\n    embedding: NdArray\n    text: str\n\nda = DocVec[MyDoc](\n    [\n        MyDoc(\n            embedding=[1, 2, 3, 4],\n            text='hello',\n        ),\n        MyDoc(\n            embedding=[5, 6, 7, 8],\n            text='world',\n        ),\n    ],\n    tensor_type=NdArray,\n)\ndf_da = da.to_dataframe()\n# 此前会出错，现已修复\nda2 = DocVec[MyDoc].from_dataframe(df_da, tensor_type=NdArray)\n```\n\n### Python 3.9 中的类型处理 (#1823)\n从 Python 3.9 开始，`Optional.__args__` 并不总是可用，这会导致一些兼容性问题。现已通过使用 `typing.get_args` 辅助函数修复了该问题。\n\n## 🤟 贡献者\n\n我们感谢本次发布的所有贡献者：\n- Johannes Messner (@JohannesMessner )","2023-10-23T08:57:14",{"id":186,"version":187,"summary_zh":188,"released_at":189},280154,"v0.39.0","## 发布说明 (`0.39.0`)\n\n> 发布时间：2023-10-02 13:06:02\n\n本次发布包含4项新功能、8个问题修复以及7处文档改进。\n\n## 🆕 新功能\n### 支持 Pydantic v2 🚀 (#1652)  \n本次发布的最大亮点是**全面支持 Pydantic v2**！与此同时，我们**将继续支持 Pydantic v1**。\n\n如果您正在使用 Pydantic v2，需要将您的 DocArray 代码适配到新的 Pydantic API。请参阅他们的迁移指南[这里](https:\u002F\u002Fdocs.pydantic.dev\u002Flatest\u002Fmigration\u002F)。\n\nPydantic v2 的核心由 Rust 编写，为 DocArray 带来了显著的性能提升：**JSON 序列化速度提升了 240%**，并且对包含非原生类型（如 `TorchTensor`）的 `BaseDoc` 和 `DocList` 进行**验证的速度提升了 20%**。\n\n### 添加 `BaseDocWithoutId` (#1803)  \n默认情况下，`BaseDoc` 包含一个 `id` 字段。然而，如果您希望构建一个不需要该 `id` 字段的 API，这可能会带来问题。因此，**我们现在提供了一个名为 `BaseDocWithoutId` 的类，顾名思义，它就是去除了 `id` 字段的 `BaseDoc`**。\n\n请谨慎使用此文档类型；除非您确实需要移除 `id` 字段，否则仍应以 `BaseDoc` 作为基类。\n\n⚠️ **`BaseDocWithoutId` 不兼容 `DocIndex`** 或任何依赖向量数据库的功能。这是因为 `DocIndex` 需要 `id` 字段来存储和检索文档。\n\n## 💣 破坏性变更\n### 移除 Jina AI 云的 `push\u002Fpull` 功能 (#1791)  \nJina AI Cloud 即将停止服务。因此，我们将移除与 Jina AI 云相关的 `push\u002Fpull` 功能。\n\n## 🐞 问题修复\n### 修复 `DocList` 订阅错误  \n`DocList` 可以通过以下语法从 `BaseDoc` 中进行类型化定义：`DocList[MyDoc]()`。\n\n在本次版本中，我们修复了一个允许用户多次指定 `DocList` 类型的 bug。  \n例如，`DocList[MyDoc1][MyDoc2]` 将不再生效 (#1800)。\n\n此外，我们还修复了一个当用户传递错误类型的 `DocList` 时（如 `DocList[doc()]`）导致静默失败的 bug (#1794)。\n\n### Milvus 连接参数缺失 (#1802)  \n我们修复了一个小 bug，该 bug 会导致 Milvus 客户端的端口被错误地设置。\n\n## 📗 文档改进\n* 修复 Pydantic v2 的文档 ([docs: fix documentation for pydantic v2 #1815](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fpull\u002F1815))  \n* 为预定义的网格 3D 文档添加字段描述 ([docs: adding field descriptions to predefined mesh 3D document #1789](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fpull\u002F1789))  \n* 为预定义的点云 3D 文档添加字段描述 ([docs: adding field descriptions to predefined point cloud 3D document #1792](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fpull\u002F1792))  \n* 为预定义的视频文档添加字段描述 ([docs: adding field descriptions to predefined video document #1775](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fpull\u002F1775))  \n* 为预定义的文本文档添加字段描述 ([docs: adding field descriptions to predefined text document #1770](https:\u002F\u002Fgithub.com\u002F","2023-10-02T13:06:43",{"id":191,"version":192,"summary_zh":193,"released_at":194},280155,"v0.38.0","## 发布说明（`0.38.0`）\n\n> 发布时间：2023-09-07 13:40:16\n\n本次发布包含 3 个错误修复和 4 项文档改进，其中包含 1 项破坏性变更。\n## 💥 破坏性变更\n### `DocList.to_json()` 和 `DocVec.to_json()` 返回类型的变更\n为了使 `to_json` 方法在不同类中保持一致，我们将 `DocList` 和 `DocVec` 中该方法的返回类型更改为 `str`。\n这意味着，如果您在应用程序中使用此方法，请确保更新代码库，以预期返回 `str` 而不是 `bytes`。\n\n## 🐞 错误修复\n\n### 使 `DocList.to_json()` 和 `DocVec.to_json()` 返回 `str` 而不是 `bytes` (#1769)\n本次发布更改了 `DocList.to_json()` 和 `DocVec.to_json()` 方法的返回类型，使其与 `BaseDoc.to_json()` 及其他 Pydantic 模型保持一致。发布后，这些方法将返回 `str` 类型的数据，而不是 `bytes`。\n💥 由于返回类型发生了变化，这被视为一项破坏性变更。\n### 在追加之前对 `reduce` 进行类型转换 (#1758)\n本次发布在 `reduce` 辅助函数内部引入了类型转换，在将其结果追加到最终结果之前，会对输入进行类型转换。这将使得能够对模式兼容但不完全相同的文档进行归约操作。\n### 在 `__annotations__` 中忽略文档属性，但在 `__fields__` 中保留 (#1777)\n本次发布修复了 `create_pure_python_type_model` 辅助函数中的一个问题。从本次发布开始，在创建类型时，仅会考虑类的 `__fields__` 中的属性。\n之前的这种行为会在用户在输入类中引入 `ClassVar` 时导致应用程序崩溃：\n```python\nclass MyDoc(BaseDoc):\n    endpoint: ClassVar[str] = \"my_endpoint\"\n    input_test: str = \"\"\n```\n\n```text\n    field_info = model.__fields__[field_name].field_info\nKeyError: 'endpoint'\n```\n\n特别感谢 @NarekA 提出该问题，并在 Jina 项目中贡献了修复方案，该方案随后被移植到 DocArray 中。\n\n## 📗 文档改进\n\n- 解释如何设置 Document 配置 (#1773)\n- 添加针对 torch compile 的解决方法 (#1754)\n- 增加关于序列化动态创建的 Doc 类的注意事项 (#1763)\n- 改进 `filter_docs` 的 docstring (#1762)\n\n## 🤟 贡献者\n\n我们感谢本次发布的所有贡献者：\n- Sami Jaghouar (@samsja )\n- Johannes Messner (@JohannesMessner )\n- AlaeddineAbdessalem (@alaeddine-13 )\n- Joan Fontanals (@JoanFM ))\n - [[```d5cb02fb```](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Fdocarray\u002Fcommit\u002Fd5cb02fbd5cc7392fb92f30c1e7ea436507eb892)] __-__ __版本__: 下一个版本将是 0.37.2 (*Jina Dev Bot*)\n\n","2023-09-07T13:40:49",{"id":196,"version":197,"summary_zh":198,"released_at":199},280156,"v0.37.1","# 发布说明 v0.37.1\n\n[\u002F\u002F]: \u003C> (请移除类似“0 个新功能”的表述)\n本次发布包含 4 个错误修复和 1 处文档改进。\n\n\n## 🐞 错误修复\n\n### 放宽更新混入中的模式检查 (#1755)\n此前 `UpdateMixin` 中的模式检查过于严格，在两个文档的模式相似但引用关系不完全一致的情况下，不允许进行更新。\n例如，当模式是动态生成的，且字段和字段类型都相同，但引用关系不同时，检查仍会返回 `False`，从而无法完成更新操作。\n本次发布放宽了该检查，改为仅检查模式中的字段是否相似即可。\n\n### 修复非类类型字段问题 (#1752)\n我们修复了一个问题：在使用 `QdrantDocumentIndex` 的模式中，如果包含非类类型字段，会导致 `TypeError`。\n现已通过在 `QdrantDocumentIndex` 的实现中将 `issubclass` 替换为 `safe_issubclass` 来解决此问题。\n\n### 修复双重嵌套模式下的动态类创建问题 (#1747)\n以下代码曾导致 `KeyError`：\n\n```python\nfrom docarray import BaseDoc\nfrom docarray.utils.create_dynamic_doc_class import create_base_doc_from_schema\n\nclass Nested2(BaseDoc):\n    value: str\n\nclass Nested1(BaseDoc):\n    nested: Nested2\n\nclass RootDoc(BaseDoc):\n    nested: Nested1\n\nnew_my_doc_cls = create_base_doc_from_schema(RootDoc.schema(), 'RootDoc')\n```\n\n现通过修改 `create_base_doc_from_schema` 函数，使其在递归调用过程中能够正确传播嵌套模式的全局定义，从而解决了该问题。\n\n### 修复自述文件测试 (#1746)\n\n## 📗 文档改进\n\n- 更新自述文件 (#1744)\n\n\n[\u002F\u002F]: \u003C> (请在每个功能的前后用一段代码示例及运行结果进行简要介绍)\n\n\n## 🤟 贡献者\n\n我们感谢本次发布的所有贡献者：\n- AlaeddineAbdessalem (@alaeddine-13)\n- Joan Fontanals (@JoanFM)\n- TERBOUCHE Hacene (@TerboucheHacene)\n- samsja (@samsja)","2023-08-22T14:10:28",{"id":201,"version":202,"summary_zh":203,"released_at":204},280157,"v0.37.0","## 发布说明 (`0.37.0`)\n\n> 发布时间：2023-08-03 03:11:16\n\n本次发布包含 6 项新功能、5 个问题修复、1 项性能优化以及 1 项文档改进。\n\n## 🆕 新功能\n\n### Milvus 集成 (#1681)\n通过这项最新集成，您可以在自己的 DocArray 项目中充分利用 Milvus 的强大功能。以下是一个简单的使用示例：\n```python\nimport numpy as np\nfrom docarray import BaseDoc\nfrom docarray.index import MilvusDocumentIndex\nfrom docarray.typing import NdArray\nfrom pydantic import Field\n\n\nclass MyDoc(BaseDoc):\n    text: str\n    embedding: NdArray[10] = Field(is_embedding=True)\n\ndocs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]\nquery = np.random.rand(10)\ndb = MilvusDocumentIndex[MyDoc]()\ndb.index(docs)\nresults = db.find(query, limit=10)\n```\n\n在这个示例中，我们创建了一个同时包含文本和数值数据的文档类。然后，初始化一个基于 Milvus 的文档索引，并用它来索引我们的文档。最后，执行一次搜索查询。\n\n**支持的功能**\n\n- **查找**：向量搜索，用于高效检索相似文档。\n- **过滤**：使用 Redis 语法，基于文本和数值数据进行过滤。\n- **获取\u002F删除**：从索引中获取或删除特定文档。\n- **混合搜索**：结合查找和过滤功能，实现更精细的搜索。\n- **子索引**：对嵌套数据进行搜索。\n\n### 支持在 `HnswDocumentIndex` 中进行过滤 (#1718)\n借助我们最新的更新，您现在可以轻松地在 `HnswDocumentIndex` 中使用过滤功能，既可以作为独立操作，也可以与查询构建器结合，从而将其与向量搜索相结合。\n\n下面的代码展示了这一新功能的工作方式：\n\n```python\nimport numpy as np\n\nfrom docarray import BaseDoc, DocList\nfrom docarray.index import HnswDocumentIndex\nfrom docarray.typing import NdArray\n\n\nclass SimpleSchema(BaseDoc):\n    year: int\n    price: int\n    embedding: NdArray[128]\n\n\n# 创建虚拟文档。\ndocs = DocList[SimpleSchema](\n    SimpleSchema(year=2000 - i, price=i, embedding=np.random.rand(128))\n    for i in range(10)\n)\n\ndoc_index = HnswDocumentIndex[SimpleSchema](work_dir=\".\u002Ftmp_5\")\ndoc_index.index(docs)\n\n# 独立的过滤操作（year == 1995）\nfilter_query = {\"year\": {\"$eq\": 1995}}\nresults = doc_index.filter(filter_query)\n\n# 过滤与向量搜索结合\nhybrid_query = (\n    doc_index.build_query()  # 获取空的查询对象\n    .filter(filter_query={\"year\": {\"$gt\": 1994}})  # 预先过滤（year > 1994）\n    .find(\n        query=np.random.rand(128), search_field=\"embedding\"\n    )  # 添加向量相似性搜索\n    .filter(filter_query={\"price\": {\"$lte\": 3}})  # 后期过滤（price \u003C= 3）\n    .build()\n)\nresults = doc_index.execute_query(hybrid_query)\n```\n\n首先，我们创建并索引了一些虚拟文档。然后，我们以两种方式使用过滤功能。一种是单独使用，用于查找特定年份的文档；另一种则是与向量","2023-08-03T03:11:46",{"id":206,"version":207,"summary_zh":208,"released_at":209},280158,"v0.36.0","## 发布说明 (`0.36.0`)\n\n> 发布时间：2023-07-18 14:43:28\n\n[\u002F\u002F]: \u003C> (移除类似“0个新特性”的表述)\n本次发布包含2个新特性、5个问题修复、1项性能优化和1项文档改进。\n\n## 🆕 特性\n\n### JAX 集成 (#1646)\n现在你可以在 Docarray 中使用 JAX。我们为你的文档引入了 `JaxArray` 这一新的类型选项。`JaxArray` 确保 JAX 现在可以原生处理 Docarray 文档中的任何数组类数据。以下是其用法示例：\n```python\nfrom docarray import BaseDoc\nfrom docarray.typing import JaxArray\nimport jax.numpy as jnp\n\n\nclass MyDoc(BaseDoc):\n    arr: JaxArray\n    image_arr: JaxArray[3, 224, 224] # 用于形状为 (3, 224, 224) 的图像\n    square_crop: JaxArray[3, 'x', 'x'] # 用于任意正方形图像，无论尺寸如何\n    random_image: JaxArray[3, ...]  # 用于具有 3 个颜色通道且其他维度任意的图像\n```\n正如你所见，`JaxArray` 类型非常灵活，能够支持多种张量形状。\n\n#### 创建包含张量的文档\n创建包含张量的文档非常简单。以下是一个示例：\n```python\ndoc = MyDoc(\n    arr=jnp.zeros((128,)),\n    image_arr=jnp.zeros((3, 224, 224)),\n    square_crop=jnp.zeros((3, 64, 64)),\n    random_image=jnp.zeros((3, 128, 256)),\n)\n```\n\n\n### Redis 集成 (#1550)\n借助此最新集成，在你的 Docarray 项目中充分利用 Redis 的强大功能。以下是一个简单的使用示例：\n```python\nimport numpy as np\nfrom docarray import BaseDoc\nfrom docarray.index import RedisDocumentIndex\nfrom docarray.typing import NdArray\n\n\nclass MyDoc(BaseDoc):\n    text: str\n    embedding: NdArray[10]\n\ndocs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]\nquery = np.random.rand(10)\ndb = RedisDocumentIndex[MyDoc](host='localhost')\ndb.index(docs)\nresults = db.find(query, search_field='embedding', limit=10)\n``` \n\n在这个示例中，我们创建了一个同时包含文本和数值数据的文档类。然后，我们初始化了一个基于 Redis 的文档索引，并用它来索引我们的文档。最后，我们执行了一次搜索查询。\n\n#### 支持的功能\n**查找**：向量搜索，用于高效检索相似文档。\n**过滤**：使用 Redis 语法，基于文本和数值数据进行过滤。\n**文本搜索**：利用 BM25 等文本搜索方法，查找相关文档。\n**获取\u002F删除**：从索引中获取或删除特定文档。\n**混合搜索**：结合查找和过滤功能，实现更精细的搜索。目前仅支持这两种功能的组合。\n**子索引**：对嵌套数据进行搜索。\n\n[\u002F\u002F]: \u003C> (用代码片段和结果优雅地介绍每个特性)\n\n\n## 🚀 性能\n\n\n### 通过缓存文档数量加速 `HnswDocumentIndex` (#1706)\n我们通过缓存文档数量优化了 `num_docs()` 操作，解决了之前在搜索过程中出现的速度下降问题。这一改动带来了轻微的 i","2023-07-18T14:44:06",{"id":211,"version":212,"summary_zh":213,"released_at":214},280159,"v0.35.0","# 发布说明 (`0.35.0`)\n\n本次发布包含 3 项新功能、2 个错误修复和 1 处文档改进。\n\n## 🆕 功能\n\n### `DocVec` 的更多序列化选项 (#1562)\n\n`DocVec` 现在拥有 **与 `DocList` 相同的序列化接口**。这意味着它现在支持以下方法：\n\n- `to_protobuf()`\u002F`from_protobuf()`\n- `to_base64()`\u002F`from_base64()`\n- `save_binary()`\u002F`load_binary()`\n- `to_bytes()`\u002F`from_bytes()`\n- `to_dataframe()`\u002F`from_dataframe()`\n\n例如，现在可以这样进行 Base64 (反)序列化：\n\n```python\nfrom docarray import BaseDoc, DocVec\n\nclass SimpleDoc(BaseDoc):\n    text: str\n\ndv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])\nbase64_repr_dv = dv.to_base64(compress=None, protocol='pickle')\n\ndl_from_base64 = DocVec[SimpleDoc].from_base64(\n    base64_repr_dv, compress=None, protocol='pickle'\n)\n```\n\n有关更多信息，请参阅 [序列化相关文档](https:\u002F\u002Fdocs.docarray.org\u002Fuser_guide\u002Fsending\u002Fserialization\u002F)。\n\n### 验证 URL 中的文件格式 (#1606) (#1669)\n\n对 `AudioURL`、`TextURL` 和 `ImageURL` 等类型中提供的文件格式进行验证，以确保其与预期的 MIME 类型相符。\n\n### 添加从模式创建 `BaseDoc` 的方法 (#1667)\n\n有时，根据原始 `BaseDoc` 的模式动态创建一个新的 `BaseDoc` 会很有用。通过使用 `create_pure_python_type_model` 和 `create_base_doc_from_schema` 方法，可以确保正确地重建 `BaseDoc`。\n\n```python\nfrom docarray.utils.create_dynamic_doc_class import (\n    create_base_doc_from_schema,\n    create_pure_python_type_model,\n)\n\nfrom typing import Optional\nfrom docarray import BaseDoc, DocList\nfrom docarray.typing import AnyTensor\nfrom docarray.documents import TextDoc\n\nclass MyDoc(BaseDoc):\n    tensor: Optional[AnyTensor]\n    texts: DocList[TextDoc]\n\nMyDocPurePython = create_pure_python_type_model(MyDoc) # 由于 DocList 作为 Pydantic List 的限制，我们需要将 MyDoc 的 DocList 转换为 List。\nNewMyDoc = create_base_doc_from_schema(\n    MyDocPurePython.schema(), 'MyDoc', {}\n)\n\nnew_doc = NewMyDoc(tensor=None, texts=[TextDoc(text='text')])\n```\n\n## 🐞 错误修复\n\n### 限制 Pydantic 版本 (#1682)\n由于 `Pydantic v2` 的重大变更，我们已将版本上限设置为避免安装 `docarray` 时出现问题。\n\n### 改进 `DocVec` 不可用时的错误信息 (#1675)\n\n在调用 `doc_list = doc_vec.to_doc_list()` 后，`doc_vec` 的数据会被转移至 `doc_list`，从而使其进入不可用状态。此次修复为用户提供了更清晰的错误提示，当他们尝试在 `doc_vec` 已经不可用后与其交互时。\n\n## 📗 文档改进\n\n- 修复 README 中的引用 (#1674)\n\n## 🤟 贡献者\n\n我们感谢本次发布的所有贡献者：\n- Saba Sturua (@jupyterjazz )\n- Joan Fontanals (@JoanFM )\n- Han Xiao (@hanxiao )\n- Johannes Mess","2023-07-03T11:53:55",{"id":216,"version":217,"summary_zh":218,"released_at":219},280160,"v0.21.1","## 发布说明 (`0.21.1`)\n\n> 发布时间：2023-06-21 08:15:43\n\n本次发布包含 1 个 bug 修复。\n\n## 🐞 Bug 修复\n\n### 允许向 WeaviateDocumentArray 传递额外的请求头 (#1673)\n\n这些额外的请求头可用于传递认证密钥，以连接到受保护的 Weaviate 实例。\n\nWeaviateDocumentArray 现已支持\n## 🤟 贡献者\n\n我们感谢本次发布的所有贡献者：\n- Girish Chandrashekar (@girishc13)\n\n","2023-06-26T15:51:02",{"id":221,"version":222,"summary_zh":223,"released_at":224},280161,"v0.34.0","## Release Note (`0.34.0`)\r\n\r\n> Release time: 2023-06-21 08:15:43\r\n\r\nThis release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.\r\n\r\n## :bomb: Breaking Changes\r\n\r\n### Terminate Python 3.7 support\r\n\r\n:warning: :warning: DocArray will now require Python 3.8. We can no longer assure compatibility with Python 3.7.\r\n\r\nWe decided to drop it for two reasons:\r\n\r\n* Several dependencies of DocArray require Python 3.8.\r\n* Python [long-term support for 3.7 is ending](https:\u002F\u002Fendoflife.date\u002Fpython) this week. This means there will no longer \r\nbe security updates for Python 3.7, making this a good time for us to change our requirements.\r\n\r\n### Changes to `DocVec` Protobuf definition (#1639)\r\n\r\nIn order to fix a bug in the `DocVec` protobuf serialization described in [#1561](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fissues\u002F1561),\r\nwe have changed the `DocVec` .proto definition.\r\n\r\nThis means that **`DocVec` objects serialized with DocArray v0.33.0 or earlier cannot be deserialized with DocArray\r\nv.0.34.0 or later, and vice versa**.\r\n\r\n:warning: :warning: **We strongly recommend** that everyone using Protobuf with `DocVec` upgrade to DocArray v0.34.0 or \r\nlater.\r\n\r\n## 🆕 Features\r\n\r\n\r\n### Allow users to check if a Document is already indexed in a DocIndex (#1633)\r\nYou can now check if a Document has already been indexed by using the `in` keyword:\r\n\r\n```python\r\nfrom docarray.index import InMemoryExactNNIndex\r\nfrom docarray import BaseDoc, DocList\r\nfrom docarray.typing import NdArray\r\nimport numpy as np\r\n\r\nclass MyDoc(BaseDoc):\r\n    text: str\r\n    embedding: NdArray[128]\r\n\r\ndocs = DocList[MyDoc](\r\n        [MyDoc(text=\"Example text\", embedding=np.random.rand(128))\r\n         for _ in range(2000)])\r\n\r\nindex = InMemoryExactNNIndex[MyDoc](docs)\r\nassert docs[0] in index\r\nassert MyDoc(text='New text', embedding=np.random.rand(128)) not in index\r\n```\r\n\r\n### Support subindexes in `InMemoryExactNNIndex` (#1617)\r\n\r\nYou can now use the [find_subindex](https:\u002F\u002Fdocs.docarray.org\u002Fuser_guide\u002Fstoring\u002Fdocindex\u002F#nested-data-with-subindex) \r\nmethod with the ExactNNSearch DocIndex.\r\n```python\r\nfrom docarray.index import InMemoryExactNNIndex\r\nfrom docarray import BaseDoc, DocList\r\nfrom docarray.typing import ImageUrl, VideoUrl, AnyTensor\r\n\r\nclass ImageDoc(BaseDoc):\r\n    url: ImageUrl\r\n    tensor_image: AnyTensor = Field(space='cosine', dim=64)\r\n\r\n\r\nclass VideoDoc(BaseDoc):\r\n    url: VideoUrl\r\n    images: DocList[ImageDoc]\r\n    tensor_video: AnyTensor = Field(space='cosine', dim=128)\r\n\r\n\r\nclass MyDoc(BaseDoc):\r\n    docs: DocList[VideoDoc]\r\n    tensor: AnyTensor = Field(space='cosine', dim=256)\r\n\r\ndoc_index = InMemoryExactNNIndex[MyDoc]()\r\n...\r\n\r\n# find by the `ImageDoc` tensor when index is populated\r\nroot_docs, sub_docs, scores = doc_index.find_subindex(\r\n    np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3\r\n)\r\n```\r\n\r\n### Flexible tensor types for protobuf deserialization (#1645)\r\n\r\nYou can deserialize any `DocVec` protobuf message to any tensor type,\r\nby passing the `tensor_type` parameter to `from_protobuf`.\r\n\r\nThis means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.\r\n\r\n```python\r\nclass MyDoc(BaseDoc):\r\n    tensor: TensorFlowTensor\r\n\r\nda = DocVec[MyDoc](...)  # doesn't matter what tensor_type is here\r\n\r\nproto = da.to_protobuf()\r\nda_after = DocVec[MyDoc].from_protobuf(proto, tensor_type=TensorFlowTensor)\r\n\r\nassert isinstance(da_after.tensor, TensorFlowTensor)\r\n```\r\n\r\n## ⚙ Refactoring\r\n\r\n### Add `DBConfig` to `InMemoryExactNNSearch`\r\n`InMemoryExactNNsearch` used to get a single parameter `index_file_path` as a constructor parameter, unlike the rest of \r\nthe Indexers who accepted their own `DBConfig`. Now `index_file_path` is part of the `DBConfig` which allows to \r\ninitialize from it.\r\nThis will allow us to extend this config if more parameters are needed.\r\n\r\nThe parameters of `DBConfig` can be passed at construction time as `**kwargs` making this change compatible with old \r\nusage.\r\n\r\nThese two initializations are equivalent.\r\n```python\r\nfrom docarray.index import InMemoryExactNNIndex\r\ndb_config = InMemoryExactNNIndex.DBConfig(index_file_path='index.bin')\r\n\r\nindex = InMemoryExactNNIndex[MyDoc](db_config=db_config)\r\nindex = InMemoryExactNNIndex[MyDoc](index_file_path='index.bin')\r\n```\r\n\r\n## 🐞 Bug Fixes\r\n\r\n### Allow protobuf deserialization of `BaseDoc` with `Union` type (#1655)\r\n\r\nSerialization of `BaseDoc` types who have `Union` types parameter of Python native types is supported.\r\n\r\n```python\r\nfrom docarray import BaseDoc\r\nfrom typing import Union\r\nclass MyDoc(BaseDoc):\r\n    union_field: Union[int, str]\r\n\r\ndocs1 = DocList[MyDoc]([MyDoc(union_field=\"hello\")])\r\ndocs2 = DocList[BasisUnion].from_dataframe(docs_basic.to_dataframe())\r\nassert docs1 == docs2\r\n```\r\n\r\nWhen these `Union` types involve other `BaseDoc` types, an exception is thrown.\r\n\r\n```python\r\nclass CustomDoc(BaseDoc):\r\n    ud: Union[TextDoc, ImageDoc] = TextDoc(t","2023-06-21T08:16:22",{"id":226,"version":227,"summary_zh":228,"released_at":229},280162,"v0.33.0","## Release Note (`0.33.0`)\r\n\r\n> Release time: 2023-06-06 14:05:56\r\n\r\nThis release contains 1 new feature, 1 performance improvement, 9 bug fixes and 4 documentation improvements.\r\n\r\n## 🆕 Features\r\n\r\n### Allow coercion between different Tensor types (#1552) (#1588)\r\n\r\nAllow coercing to a `TorchTensor` from an `NdArray` or `TensorFlowTensor` and the other way around.\r\n\r\n```python\r\nfrom docarray import BaseDoc\r\nfrom docarray.typing import TorchTensor\r\nimport numpy as np\r\n\r\n\r\nclass MyTensorsDoc(BaseDoc):\r\n    tensor: TorchTensor\r\n\r\n\r\ndoc = MyTensorsDoc(tensor=np.zeros(512))\r\ndoc.summary()\r\n```\r\n\r\n```\r\n📄 MyTensorsDoc : 0a10f88 ...\r\n╭─────────────────────┬────────────────────────────────────────────────────────╮\r\n│ Attribute           │ Value                                                  │\r\n├─────────────────────┼────────────────────────────────────────────────────────┤\r\n│ tensor: TorchTensor │ TorchTensor of shape (512,), dtype: torch.float64      │\r\n╰─────────────────────┴────────────────────────────────────────────────────────╯\r\n```\r\n\r\n## 🚀 Performance\r\n\r\n### Avoid stack embedding for every search (#1586)\r\n\r\nWe have made a performance improvement for the `find` interface for `InMemoryExactNNIndex` that gives a ~2x speedup.\r\n\r\nThe script used to measure this is as follows:\r\n\r\n```python\r\nfrom torch import rand\r\nfrom time import perf_counter\r\n​\r\nfrom docarray import BaseDoc, DocList\r\nfrom docarray.index.backends.in_memory import InMemoryExactNNIndex\r\nfrom docarray.typing import TorchTensor\r\n​\r\n​\r\nclass MyDocument(BaseDoc):\r\n    embedding: TorchTensor\r\n    embedding2: TorchTensor\r\n    embedding3: TorchTensor\r\n​\r\ndef generate_doc_list(num_docs: int, dims: int) -> DocList[MyDocument]:\r\n    return DocList[MyDocument](\r\n        [\r\n            MyDocument(\r\n                embedding=rand(dims),\r\n                embedding2=rand(dims),\r\n                embedding3=rand(dims),\r\n            )\r\n            for _ in range(num_docs)\r\n        ]\r\n    )\r\n​\r\nnum_docs, num_queries, dims = 500000, 1000, 128\r\ndata_list = generate_doc_list(num_docs, dims)\r\nqueries = generate_doc_list(num_queries, dims)\r\n​\r\nindex = InMemoryExactNNIndex[MyDocument](data_list)\r\n​\r\nstart = perf_counter()\r\nfor _ in range(5):\r\n    matches, scores =  index.find_batched(queries, search_field='embedding')\r\n​\r\nprint(f\"Number of queries: {num_queries} \\n\"\r\n      f\"Number of indexed documents: {num_docs} \\n\"\r\n      f\"Total time: {(perf_counter() - start)\u002F5} seconds\")\r\n\r\n```\r\n\r\n## 🐞 Bug Fixes\r\n\r\n### Respect `limit` parameter in `filter` for index backends (#1618)\r\n\r\n`InMemoryExactNNIndex` and `HnswDocumentIndex` now respect the `limit` parameter in the `filter` API.\r\n\r\n### `HnswDocumentIndex` can search with `limit` greater than number of documents (#1611)\r\n\r\n`HnswDocumentIndex` now allows to call `find` with a `limit` parameter larger than the number of indexed documents.\r\n\r\n### Allow updating `HnswDocumentIndex` (#1604)\r\n\r\n`HnswDocumentIndex` now allows reindexing documents with the same `id`, updating the original documents.\r\n\r\n### Dynamically resize internal index to adapt to increasing number of documents (#1602)\r\n\r\n`HnswDocumentIndex` now allows indexing more than `max_elements`, dynamically adapting the index as it grows.\r\n\r\n### Fix simple usage of `HnswDocumentIndex` (#1596)\r\n\r\n```python\r\nfrom docarray.index import HnswDocumentIndex\r\nfrom docarray import DocList, BaseDoc\r\nfrom docarray.typing import NdArray\r\nimport numpy as np\r\n\r\nclass MyDoc(BaseDoc):\r\n    text: str\r\n    embedding: NdArray[128]\r\n\r\ndocs = [MyDoc(text='hey', embedding=np.random.rand(128)) for i in range(200)]\r\nindex = HnswDocumentIndex[MyDoc](work_dir='.\u002Ftmp', index_name='index')\r\nindex.index(docs=DocList[MyDoc](docs))\r\nresp = index.find_batched(queries=DocList[MyDoc](docs[0:3]), search_field='embedding')\r\n\r\n```\r\n\r\nPreviously, this basic usage threw an exception:\r\n\r\n```\r\nTypeError: ModelMetaclass object argument after  must be a mapping, not MyDoc\r\n```\r\n\r\nNow, it works as expected.\r\n\r\n### Fix `InMemoryExactNNIndex` index initialization with nested `DocList` (#1582)\r\n\r\nInstantiating an `InMemoryExactNNIndex` with a `Document` schema that had a nested `DocList` previously threw this error:\r\n\r\n```python\r\nfrom docarray import BaseDoc, DocList\r\nfrom docarray.documents import TextDoc\r\nfrom docarray.index import HnswDocumentIndex\r\n\r\nclass MyDoc(BaseDoc):\r\n    text: str,\r\n    d_list: DocList[TextDoc]\r\n\r\nindex = HnswDocumentIndex[MyDoc]()\r\n```\r\n\r\n```\r\nTypeError: docarray.index.abstract.BaseDocIndex.__init__() got multiple values for keyword argument 'db_config'\r\n```\r\n\r\nNow it can be successfully instantiated.\r\n\r\n### Fix summary of document with list (#1595)\r\n\r\nCalling `summary` on a document with a `List` attribute previously showed the wrong type:\r\n\r\n```python\r\nfrom docarray import BaseDoc, DocList\r\nfrom typing import List\r\nclass TestDoc(BaseDoc):\r\n    str_list: List[str]\r\n\r\ndl = DocList[TestDoc]([TestDoc(str_list=[]), TestDoc(str_list=[\"1\"])])\r\ndl.summary()\r\n```\r\nPrevious output:\r\n```\r\n╭─────── D","2023-06-06T14:06:30",{"id":231,"version":232,"summary_zh":233,"released_at":234},280163,"v0.32.1","## Release Note (`0.32.1`)\r\n\r\n> Release time: 2023-05-26 14:50:34\r\n\r\nThis release contains 4 bug fixes, 1 refactoring and 2 documentation improvements.\r\n\r\n## ⚙ Refactoring\r\n\r\n### Improve `ElasticDocIndex` logging (#1551)\r\n\r\nMore debugging logs have been added inside `ElasticDocIndex`.\r\n\r\n## 🐞 Bug Fixes\r\n\r\n### Allow `InMemoryExactNNIndex` with `Optional` embedding tensors (#1575)\r\n\r\nYou can now index Documents where the tensor `search_field` is `Optional`. The index will not consider these `None` embeddings when running a search.\r\n\r\n```python\r\nimport torch\r\nfrom typing import Optional\r\n\r\nfrom docarray import BaseDoc, DocList\r\nfrom docarray.typing import TorchTensor\r\nfrom docarray.index import InMemoryExactNNIndex\r\n\r\n\r\nclass EmbeddingDoc(BaseDoc):\r\n    embedding: Optional[TorchTensor[768]]\r\n\r\nindex = InMemoryExactNNIndex[TestDoc](DocList[TestDoc]([TestDoc(embedding=(torch.rand(768,) if i % 2 else None)) for i in range(5)]))\r\nindex.find(torch.rand((768,)), search_field=\"embedding\", limit=3)\r\n```\r\n\r\n### Safe `is_subclass` check (#1569)\r\n\r\nIn DocArray, especially when dealing with indexers, field types are checked that lead to calls to Python's `is_subclass` method.\r\nThis call fails under some circumstances, for instance when checked for a `List` or `Tuple`. Starting with this release, we use a safe version that does not fail for these cases.\r\n\r\nThis enables the following usage, which would otherwise fail:\r\n\r\n```python\r\nfrom docarray import BaseDoc\r\nfrom docarray.index import HnswDocumentIndex\r\n\r\nclass MyDoc(BaseDoc):\r\n    test: List[str]\r\n\r\nindex = HnswDocumentIndex[MyDoc]()\r\n```\r\n\r\n### Fix `AnyDoc` deserialization (#1571)\r\n\r\n`AnyDoc` is a schema-less special Document that adapts to the schema of the data it tries to load. However, in cases where the data contained Dictionaries or Lists, deserialization failed. This is now fixed and you can have this behavior:\r\n\r\n```python\r\nfrom docarray.base_doc import AnyDoc, BaseDoc\r\nfrom typing import Dict\r\n\r\nclass ConcreteDoc(BaseDoc):\r\n    text: str\r\n    tags: Dict[str, int]\r\n\r\ndoc = ConcreteDoc(text='text', tags={'type': 1})\r\n\r\nany_doc = AnyDoc.from_protobuf(doc.to_protobuf())\r\nassert any_doc.text == 'text'\r\nassert any_doc.tags == {'type': 1}\r\n```\r\n\r\n### `dict` method for Document view (#1559)\r\n\r\nPrior to this fix, `doc.dict()` would return an empty Dictionary if `doc.is_view() == True`:\r\n\r\n```python\r\nclass MyDoc(BaseDoc):\r\n    foo: int\r\n\r\nvec = DocVec[MyDoc]([MyDoc(foo=3)])\r\n# before\r\ndoc = vec[0]\r\nassert doc.is_view()\r\nprint(doc.dict())\r\n# > {}\r\n\r\n# after\r\ndoc = vec[0]\r\nassert doc.is_view()\r\nprint(doc.dict())\r\n# > {'id': 'f285db406a949a7e7ab084032800f7d8', 'foo': 3}\r\n```\r\n\r\n## 📗 Documentation Improvements\r\n\r\n- Update doc building guide (#1566)\r\n- Explain the state of `DocList` in FastAPI (#1546)\r\n\r\n## 🤟 Contributors\r\n\r\nWe would like to thank all contributors to this release:\r\n- aman-exp-infy (@agaraman0)\r\n- Johannes Messner (@JohannesMessner)\r\n- Joan Fontanals (@JoanFM)\r\n- Saba Sturua (@jupyterjazz)\r\n- Ge Jin (@maxwelljin)","2023-05-26T14:51:11",{"id":236,"version":237,"summary_zh":238,"released_at":239},280164,"v0.32.0","# Release Note (`v0.32.0`)\r\n\r\nThis release contains 4 new features, 0 performance improvements, 5 bug fixes and 4 documentation improvements.\r\n\r\n## 🆕 Features\r\n\r\n### Subindex for document index (#1428)\r\n\r\nThe subindex feature allows you to index documents that contain another `DocList` by automatically creating a separate collection\u002Findex for each such `DocList`:\r\n\r\n```python\r\n# create nested document schema\r\nclass SimpleDoc(BaseDoc):\r\n    tensor: NdArray[10]\r\n    text: str\r\n\r\n\r\nclass MyDoc(BaseDoc):\r\n    docs: DocList[SimpleDoc]\r\n\r\n\r\n# create some docs\r\nmy_docs = [\r\n    MyDoc(\r\n        docs=DocList[SimpleDoc](\r\n            [\r\n                SimpleDoc(\r\n                    tensor=np.ones(10) * (j + 1),\r\n                    text=f\"hello {j}\",\r\n                )\r\n                for j in range(10)\r\n            ]\r\n        ),\r\n    )\r\n]\r\n\r\n# index them into Elasticsearch\r\nindex = ElasticDocIndex[MyDoc](index_name=\"idx\")\r\nindex.index(my_docs)  # index with name 'idx' and 'idx__docs' will be generated\r\n\r\n# search on the nested level (subindex)\r\nquery = np.random.rand(10)\r\nmatches_root, matches_nested, scores = index.find_subindex(\r\n    query, search_field=\"docs__tensor\", limit=5\r\n)\r\n```\r\n\r\n### Openapi and FastAPI tensor shapes (#1510)\r\nWe have enabled shaped tensors to be properly represented in OpenAPI\u002FSwaggerUI, both in examples and the schema.\r\n\r\nThis means that you can now built web APIs using FastAPI where the SwaggerUI properly communicates tensor shapes to your users:\r\n\r\n```python\r\nclass Doc(BaseDoc):\r\n    embedding_torch: TorchTensor[3, 4]\r\n\r\n\r\napp = FastAPI()\r\n\r\n\r\n@app.post(\"\u002Ffoo\", response_model=Doc, response_class=DocArrayResponse)\r\nasync def foo(doc: Doc) -> Doc:\r\n    return Doc(embedding=doc.embedding_np)\r\n```\r\n\r\nGenerated Swagger UI:\r\n  \r\n![image](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fassets\u002F44071807\u002F54acef18-483d-44a3-ad93-914563658ee2)\r\n![image](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Fassets\u002F44071807\u002Fe65d0105-7258-4a1d-a2c2-99bd198f8117)\r\n\r\n\r\n### Save and load inmemory index (#1534)\r\nWe added a `persist` method to the `InMemoryExactNNIndex` class to save the index to disk.\r\n```python\r\n# Save your existing index as a binary file\r\ndoc_index.persist('docs.bin')\r\n# Initialize a new document index using the saved binary file\r\nnew_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin')\r\n```\r\n\r\n## 🐞 Bug Fixes\r\n\r\n### `search_field`  should be optional in hybrid text search (#1516)\r\nWe have added a sane default to `text_search()`  for the `search_field` argument that is now Optional.\r\n\r\n### Check if file path exists for in-memory index (#1537)\r\nWe have added an internal check to see if `index_file_path` exists when passed to `InMemoryExactNNIndex`.\r\n\r\n### Add empty judgement to index search (#1533)\r\nWe have ensured that empty indices do not fail when `find` is called.\r\n\r\n### Detach torch tensors (#1526)\r\nSerializing tensors with gradients no longer fails.\r\n\r\n### `Docvec` display (#1522)\r\n`Docvec` display issues have been resolved.\r\n\r\n## 📗 Documentation Improvements\r\n\r\n- Remove erroneous info (#1531)\r\n- Fix link to documentation in readme (#1525)\r\n- Flatten structure (#1520)\r\n- Fix links (#1518)\r\n\r\n## 🤟 Contributors\r\n\r\nWe would like to thank all contributors to this release:\r\n- Mohammad Kalim Akram (@makram93)\r\n- Johannes Messner (@JohannesMessner)\r\n- Anne Yang (@AnneYang720)\r\n- Zhaofeng Miao (@mapleeit)\r\n- Joan Fontanals (@JoanFM)\r\n- Kacper Łukawski (@kacperlukawski)\r\n- IyadhKhalfallah (@IyadhKhalfallah)\r\n- Saba Sturua (@jupyterjazz)","2023-05-16T11:30:13",{"id":241,"version":242,"summary_zh":243,"released_at":244},280165,"v0.31.1","## Release Note (`0.31.1`)\r\n\r\nThis patch release fixes a small bug that was introduced in the latest minor release (`0.31.0`).\r\n\r\n## 🐞 Bug Fixes\r\n\r\n\r\n- Calling `json` or `dict` on a Optional nested DocList does not throw an error anymore if the value is set to `None` (#1512) \r\n\r\n## 🤟 Contributors\r\n\r\nWe would like to thank all contributors to this release:\r\n\r\n- samsja (@samsja)\r\n\r\n\r\n","2023-05-08T16:31:28",{"id":246,"version":247,"summary_zh":248,"released_at":249},280166,"v0.31.0","# Release Note (`v0.31.0`)\r\n\r\nThis release contains 4 new features, 11 bug fixes, and several documentation improvements.\r\n\r\n## 💥 Breaking changes\r\n\r\n### Return type of `DocVec` Optional Tensor (#1472)\r\n\r\nOptional tensor fields in a `DocVec` will return `None` instead of a list of `Nan` if the column does not hold any tensor.\r\n\r\nThis code snippet shows the breaking change:\r\n\r\n```python\r\nfrom typing import Optional\r\n\r\nfrom docarray import BaseDoc, DocVec\r\nfrom docarray.typing import NdArray\r\n\r\nclass MyDoc(BaseDoc):\r\n    tensor: Optional[NdArray[10]]\r\n\r\ndocs = DocVec[MyDoc]([MyDoc() for j in range(2)])\r\n\r\nprint(docs.tensor)\r\n```\r\n\r\n| Version | Return type |\r\n| --- | --- |\r\n| 0.30.0 | `[nan nan]` |\r\n| 0.31.0 | `None` |\r\n\r\n### Default index collection names\r\n\r\nMost vector databases have a concept similar to a 'table' in a relational database; this concept is usually called 'collection', 'index', 'class' or similar.\r\n\r\nIn DocArray v0.30.0, every Document Index backend defined its own default name for this, i.e. a default `index_name` or `collection_name`.\r\n\r\nStarting with DocArray v0.30.0, the default `index_name`\u002F`collection_name` will be derived from the document schema name:\r\n\r\n```python\r\nfrom docarray.index.backends.weaviate import WeaviateDocumentIndex\r\nfrom docarray import BaseDoc\r\n\r\nclass MyDoc(BaseDoc):\r\n    pass\r\n\r\n# With v0.30.0, the line below defaults to `index_name='Document'`.\r\n# This was the default regardless of the Document Index schema.\r\n# With v0.31.0, the line below defaults to `index_name='MyDoc'`\r\n# The default now depends on the schema, i.e. the `MyDoc` class.\r\nstore = WeaviateDocumentIndex[MyDoc]()\r\n```\r\n\r\n**If you create an persist a Document Index with v0.30.0, and try to access it using v0.31.0 without manually specifying an index name, an Exception will occur.**\r\n\r\nYou can fix this by manually specifying the index name to match the old default:\r\n\r\n```python\r\n# Create new Document Index using v0.30.0\r\nstore = WeaviateDocumentIndex[MyDoc](host=..., port=...)\r\n# Access it using v0.31.0\r\nstore = WeaviateDocumentIndex[MyDoc](host=..., port=..., index_name='Document')\r\n```\r\n\r\nThe below table summarizes the change for all DB backends:\r\n\r\n|                                          | DBConfig argument | Default in v0.30.0 | Default in v0.31.0     |\r\n| ---------                             |:---                              |:---------                   |:-----|\r\n| WeaviateDocumentIndex | `index_name`           | 'Document'           | Schema class name |\r\n| QdrantDocumentIndex     | `collection_name`    | 'documents'          | Schema class name |\r\n| ElasticDocIndex               | `index_name`           | 'index__' + a random id | Schema class name |\r\n| ElasticV7DocIndex          | `index_name`           | 'index__' + a random id | Schema class name |\r\n| HnswDocumentIndex       | n\u002Fa                           | n\u002Fa                         | n\u002Fa |\r\n\r\n\r\n## 🆕 Features\r\n\r\n### Add `InMemoryExactNNIndex` (#1441)\r\n\r\nIn this version we have introduced the `InMemoryExactNNIndex` Document Index which allows you to perform in-memory exact vector search (as opposed to approximate nearest neighbor search in vector databases). \r\n\r\nThe `InMemoryExactNNIndex` can be used for prototyping and is suitable for dealing with small-scale documents (1k-10k), as opposed to a vector database that is suitable for larger scales but comes with a performance overhead at smaller scales.\r\n\r\n```python\r\nfrom docarray import BaseDoc, DocList\r\nfrom docarray.index.backends.in_memory import InMemoryExactNNIndex\r\nfrom docarray.typing import NdArray\r\n\r\nimport numpy as np\r\n\r\nclass MyDoc(BaseDoc):\r\n    tensor: NdArray[512]\r\n\r\ndocs = DocList[MyDoc](MyDoc(tensor=i*np.ones(512)) for i in range(10))\r\n\r\ndoc_index = InMemoryExactNNIndex[MyDoc]()\r\ndoc_index.index(docs)\r\n\r\nprint(doc_index.find(3*np.ones(512), search_field='tensor', top_k=3))\r\n```\r\n\r\n```python\r\nFindResult(documents=\u003CDocList[MyDoc] (length=10)>, scores=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))\r\n```\r\n\r\n### `DocList` inherits from Python `list` (#1457)\r\n\r\n`DocList` is now a subclass of Python's `list`. This means that you can now use all the methods that are available to Python lists on `DocList` objects. For example, you can now use `len` on `DocList` objects and tools like Pydantic or FastAPI will be able to work with it more easily.\r\n\r\n### Add `len` to `DocIndex` (#1454)\r\n\r\nYou can now perform `len(vector_index)` which is equivalent to `vector_index.num_docs()`.\r\n\r\n### Other minor features\r\n\r\n- Add a `to_json` alias to `BaseDoc` (#1494)\r\n\r\n## 🐞 Bug Fixes\r\n\r\n### Point to older versions when importing `Document` or `Documentarray` (#1422)\r\n\r\nTrying to load `Document` or `DocumentArray` from DocArray would previously raise an error, saying that you needed to downgrade your version of DocArray if you wanted to use these two objects. This behavior has been fixed.\r\n\r\n### Fix `AnyDoc.from_protobuf` (#1437)\r\n\r\n`AnyDoc` can now read any `BaseDoc` protobuf file. The same applies","2023-05-08T09:41:40",{"id":251,"version":252,"summary_zh":253,"released_at":254},280167,"v0.30.0","# 💫 Release v0.30.0 (a.k.a DocArray v2)\r\n\r\n> **Warning**\r\n> This version of DocArrray is a complete rewrite, therefore it includes several (more than breaking) changes. Be sure to check the [documentation](https:\u002F\u002Fdocs.docarray.org\u002F) to prepare your migration.\r\n\r\n## Changelog\r\n\r\nIf you are using DocArray v\u003C0.30.0, you will be familiar with its [dataclass API](https:\u002F\u002Fdocarray.jina.ai\u002Ffundamentals\u002Fdataclass\u002F).\r\n\r\nDocArray v2 is that idea, taken seriously. Every document is created through dataclass-like interface, courtesy of [Pydantic](https:\u002F\u002Fpydantic-docs.helpmanual.io\u002Fusage\u002Fmodels\u002F).\r\n\r\nThis gives the following advantages:\r\n\r\n- **Flexibility:** No need to conform to a fixed set of fields -- your data defines the schema.\r\n- **Multimodality:** Easily store multiple modalities and multiple embeddings in the same Document.\r\n- **Language agnostic:** At their core, Documents are just dictionaries. This makes it easy to create and send them from any language, not just Python.\r\n\r\nYou may also be familiar with our old Document Stores for vector database integration. They are now called **Document Indexes** and offer the following improvements:\r\n\r\n- **Hybrid search:** You can now combine vector search with text search, and even filter by arbitrary fields.\r\n- **Production-ready:** The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain.\r\n- **Increased flexibility:** We strive to support any configuration or setting that you could perform through the DB's first-party client.\r\n\r\nFor now, Document Indexes support **[Weaviate](https:\u002F\u002Fweaviate.io\u002F)**, **[Qdrant](https:\u002F\u002Fqdrant.tech\u002F)**, **[ElasticSearch](https:\u002F\u002Fwww.elastic.co\u002F)**, and **[HNSWLib](https:\u002F\u002Fgithub.com\u002Fnmslib\u002Fhnswlib)**, with more to come.\r\n\r\n## Changes to `Document`\r\n\r\n- `Document` has been renamed to [`BaseDoc`](https:\u002F\u002Fdocs.docarray.org\u002FAPI_reference\u002Fbase_doc\u002Fbase_doc\u002F).\r\n- `BaseDoc` cannot be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.\r\n- Following from the previous point, extending `BaseDoc` allows for a flexible schema compared to the `Document` class in v1 which only allowed for a fixed schema, with one of `tensor`, `text` and `blob`, and additional `chunks` and `matches`.\r\n- Due to the added flexibility, one can not know what fields your document class will provide. Therefore, various methods from v1 (such as `.load_uri_to_image_tensor()`) are not supported in v2. Instead, we provide some of those methods on the [typing-level](data_types\u002Ffirst_steps.md). \r\n- In v2 we have the [`LegacyDocument`](https:\u002F\u002Fdocs.docarray.org\u002FAPI_reference\u002Fdocuments\u002Fdocuments\u002F#docarray.documents.legacy.LegacyDocument) class, which extends `BaseDoc` while following the same schema as v1's `Document`. The `LegacyDocument` can be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1 `Document`. Indeed, none of the methods associated with `Document` are present. Only the schema of the data is similar.\r\n\r\n## Changes to `DocumentArray`\r\n\r\n### DocList\r\n\r\n- The `DocumentArray` class from v1 has been renamed to [`DocList`](https:\u002F\u002Fdocs.docarray.org\u002FAPI_reference\u002Farray\u002Fda\u002F), to be more descriptive of its actual functionality, since it is a list of `BaseDoc`s.\r\n\r\n### DocVec\r\n\r\n- Additionally, we introduced the class [`DocVec`](https:\u002F\u002Fdocs.docarray.org\u002FAPI_reference\u002Farray\u002Fda_stack\u002F), which is a column-based representation of `BaseDoc`s. Both `DocVec` and `DocList` extend `AnyDocArray`.\r\n- `DocVec` is a container of Documents appropriates to perform computation that require batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass).\r\n- A `DocVec` has a similar interface as `DocList` but with an underlying implementation that is column-based instead of row-based. Each field of the schema of the `DocVec` (the `.doc_type` which is a `BaseDoc`) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a single `doc_vec` (Torch\u002FTensorFlow\u002FNumPy) tensor. If the tensor field is `AnyTensor` or a Union of tensor types, the `.tensor_type` will be used to determine the type of the `doc_vec` column. \r\n\r\n### Parameterized DocList\r\n\r\n- With the added flexibility of your document schema, and therefore endless options to design your document schema, when initializing a `DocList` it does not necessarily have to be homogenous. \r\n- If you want a homogenous `DocList` you can parameterize it at initialization time:\r\n\r\n```python\r\nfrom docarray import DocList\r\nfrom docarray.documents import ImageDoc\r\n\r\ndocs = DocList[ImageDoc]()\r\n```\r\n\r\n- Methods like `.from_csv()` or `.pull()` only work with parameterized `DocList`s. \r\n\r\n### Access attributes of your DocumentArray\r\n\r\n- In v1 you could access an attribute of all Documents in your DocumentArray by calling the plural of the attribute's name on your DocArray","2023-04-18T07:59:14",{"id":256,"version":257,"summary_zh":258,"released_at":259},280168,"v0.21.0","## Release Note (`0.21.0`)\r\n\r\n> Release time: 2023-01-17 09:10:50\r\n\r\nThis release contains 3 new features, 7 bug fixes and 5 documentation improvements.\r\n\r\n## 🆕 Features\r\n\r\n### OpenSearch Document Store (#853)\r\n\r\nThis version of DocArray adds a **new Document Store**: [OpenSearch](https:\u002F\u002Fopensearch.org\u002F)!\r\n\r\nYou can use the OpenSearch Document Store to index your Documents and perform **ANN search** on them:\r\n\r\n```python\r\nfrom docarray import Document, DocumentArray\r\nimport numpy as np\r\n\r\n# Connect to OpenSearch instance\r\nn_dim = 3\r\n\r\nda = DocumentArray(\r\n    storage='opensearch',\r\n    config={'n_dim': n_dim},\r\n)\r\n\r\n# Index Documents\r\nwith da:\r\n    da.extend(\r\n        [\r\n            Document(id=f'r{i}', embedding=i * np.ones(n_dim))\r\n            for i in range(10)\r\n        ]\r\n    )\r\n\r\n# Perform ANN search\r\nnp_query = np.ones(n_dim) * 8\r\nresults = da.find(np_query, limit=10)\r\n```\r\n\r\nAdditionally, the OpenSearch Document Store can perform **filter queries, search by text**, and **search by tags**.\r\n\r\nLearn more about its usage in the [official documentation](https:\u002F\u002Fdocarray.jina.ai\u002Fadvanced\u002Fdocument-store\u002Fopensearch\u002F).\r\n\r\n\r\n### Add color to point cloud display (#961)\r\n\r\nYou can now **include color information in your point cloud data**, which can be visualized using `display_point_cloud_tensor()`:\r\n\r\n```python\r\ncoords = np.load('a_red_motorbike\u002Fcoords.npy')\r\ncolors = np.load('a_red_motorbike\u002Fcoord_colors.npy')\r\n\r\ndoc = Document(\r\n    tensor=coords,\r\n    chunks=DocumentArray([Document(tensor=colors, name='point_cloud_colors')])\r\n)\r\ndoc.display()\r\n```\r\n![image](https:\u002F\u002Fuser-images.githubusercontent.com\u002F44071807\u002F212716603-4031d721-7612-41a6-9e52-768de31212d6.png)\r\n\r\n\r\n### Add language attribute to Redis Document Store (#953)\r\n\r\nThe Redis Document Store now supports **text search in** [various supported languages](https:\u002F\u002Fredis.io\u002Fdocs\u002Fstack\u002Fsearch\u002Freference\u002Fstemming\u002F). To set a desired language, change the `language` parameter in the Redis configuration:\r\n\r\n```python\r\nda = DocumentArray(\r\n    storage='redis',\r\n    config={\r\n        'n_dim': 128,\r\n        'index_text': True,\r\n        'language': 'chinese',\r\n    },\r\n)\r\n```\r\n\r\n[\u002F\u002F]: \u003C> (Nicely introduce each feature before\u002Fafter with some code snippet and results)\r\n\r\n## 🐞 Bug Fixes\r\n\r\n### Replace newline with whitespace to fix display in plot embeddings (#963)\r\nWhenever the string `\"\\n\"` was contained in any Document field, `doc.plot()` would result in a rendering error. This fixes those errors be rendering `\"\\n\"` as whitespace.\r\n### Fix unwanted coercion in `to_pydantic_model` (#949)\r\nThis bug caused all strings of the form `'Infinity'` to be coerced to the string `'inf'` when calling `to_pydantic_model()` or `to_dict()`. This is fixed now, leaving such strings unchanged.\r\n### Calculate relevant docs on index instead of queries (#950)\r\nIn the `embed_and_evaluate()` method, the number of relevant Documents per label used to be calculated based on the Document in `self`. This is not generally correct, so after this fix the quantity is calculated based on the Documents in the index data.\r\n### Remove offset index create on list like false (#936)\r\nWhen a Document Store has list-like behavior disabled, it no longer creates an offset to id mapping, which improves performance.\r\n### Add support for remote audio files (#933)\r\nLoading audio files from a remote URL would cause `FileNotFoundError`, which is now fixed.\r\n### Query operator `$exists` does not work correctly with tags (#911) (#923)\r\nBefore this fix, `$exists` would treat false-y values such as `0` or `[]` as non existent. This is now fixed.\r\n### Document from dataclass with singleton list (#1018)\r\nWhen casting from a dataclass to Document, singleton lists were treated like an individual element, even if the corresponding field was annotated with `List[...]`. Now this case is considered, and accessing such a field will yield a DocumentArray, even for singleton inputs.\r\n\r\n\r\n[\u002F\u002F]: \u003C> (Nicely introduce each bug fix before\u002Fafter with some code snippet and results)\r\n\r\n## 📗 Documentation Improvements\r\n\r\n- Link to Discord (#1010)\r\n- Have less versions to avoid deployment timeout (#977)\r\n- Fix data management section not appearing in Documentation (#967)\r\n- Link to OpenSearch docs in sidebar (#960)\r\n- Multimodal to datatypes (#934)\r\n\r\n[\u002F\u002F]: \u003C> (Nicely introduce each feature before\u002Fafter with some code snippet and results)\r\n\r\n## 🤟 Contributors\r\n\r\nWe would like to thank all contributors to this release:\r\n- Jay Bhambhani (@jay-bhambhani)\r\n- Alvin Prayuda (@alphinside)\r\n- Johannes Messner (@JohannesMessner)\r\n- samsja (@samsja)\r\n- Marco Luca Sbodio (@marcosbodio)\r\n- Anne Yang (@AnneYang720)\r\n- Michael Günther (@guenthermi)\r\n- AlaeddineAbdessalem (@alaeddine-13)\r\n- Han Xiao (@hanxiao)\r\n- Alex Cureton-Griffiths (@alexcg1)\r\n- Charlotte Gerhaher (@anna-charlotte)","2023-01-17T09:11:28",{"id":261,"version":262,"summary_zh":263,"released_at":264},280169,"v0.20.1","# Release Note (`0.20.1`)\r\n\r\n> Release time: 2022-12-12 09:32:37\r\n\r\n## 🐞 Bug Fixes\r\n\r\n### Make Milvus DocumentArray thread safe and suitable for pytest (#904)\r\n\r\nThis bug was causing connectivity issues when using _multiple DocumentArrays in different threads to connect to the same Milvus instance_, e.g. in pytest.\r\n\r\nThis would produce an error like the following:\r\n\r\n```bash\r\nE1207 14:59:51.357528591    2279 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers\r\nE1207 14:59:51.367985469    2279 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers\r\nE1207 14:59:51.457061884    3934 ev_epoll1_linux.cc:824]     assertion failed: gpr_atm_no_barrier_load(&g_active_poller) != (gpr_atm)worker\r\nFatal Python error: Aborted\r\n```\r\n\r\nThis fix _creates a separate gRPC connection for each MilvusDocumentArray instance_, circumventing the issue.\r\n\r\n### Restore backwards compatibility for (de)serialization (#903)\r\n\r\n_DocArray v0.20.0 broke (de)serialization backwards compatibility with earlier versions_ of the library, making it impossible to load DocumentArrays from v0.19.1 or earlier from disk:\r\n\r\n```python\r\n# DocArray \u003C= 0.19.1\r\nda = DocumentArray([Document() for _ in range(10)])\r\nda.save_binary('old-da.docarray')\r\n# DocArray == 0.20.0\r\nda = DocumentArray.load_binary('old-da.docarray')\r\nda.extend([Document()])\r\nprint(da)\r\n```\r\n\r\n```bash\r\nAttributeError: 'DocumentArrayInMemory' object has no attribute '_is_subindex'\r\n```\r\n\r\n_This fix restores backwards compatibility_ by not relying on newly introduced private attributes:\r\n\r\n```python\r\n# DocArray \u003C= 0.19.1\r\nda = DocumentArray([Document() for _ in range(10)])\r\nda.save_binary('old-da.docarray')\r\n# DocArray == 0.20.1\r\nda = DocumentArray.load_binary('old-da.docarray')\r\nda.extend([Document()])\r\nprint(da)\r\n```\r\n\r\n```bash\r\n\u003CDocumentArray (length=11) at 140683902276416>\r\n\r\nProcess finished with exit code 0\r\n```\r\n\r\n## 📗 Documentation Improvements\r\n\r\n- Polish docs throughout (#895)\r\n\r\n## 🤟 Contributors\r\n\r\nWe would like to thank all contributors to this release:\r\n- Anne Yang (@AnneYang720)\r\n- Nan Wang (@nan-wang)\r\n- anna-charlotte (@anna-charlotte)\r\n- Alex Cureton-Griffiths (@alexcg1)","2022-12-12T09:33:10",{"id":266,"version":267,"summary_zh":268,"released_at":269},280170,"v0.20.0","# Release Note (`0.20.0`)\r\n\r\n> Release time: 2022-12-07 12:15:30\r\n\r\nThis release contains 8 new features, 3 bug fixes and 7 documentation improvements.\r\n\r\n## 🆕 Features\r\n\r\n### Milvus document store (#587)\r\n\r\nThis release supports the [Milvus](https:\u002F\u002Fgithub.com\u002Fmilvus-io\u002Fmilvus) vector database as a document store.\r\n\r\n```python\r\nda = DocumentArray(storage='milvus', config={'n_dim': 3))\r\n```\r\n\r\n### Root_id for document stores (#808)\r\n\r\nWhen working with a vector database you can now retrieve the root document even if you search at a nested level with sub-indices (for example at chunk level).\r\n\r\n```python\r\ntop_level_matches = da.find(query=np.random.rand(512), on='@.[image]', return_root=True)\r\n```\r\n\r\nTo allow this we now store the `root_id` in the chunks' tags. You can enable this by passing `root_id=True` in your document store configuration.\r\n\r\n### Filtering based on text keywords for Qdrant (#849)\r\n\r\nYou can now filter based on text keywords for the Qdrant document store.\r\n\r\n```python\r\nfilter = {\r\n    'must': [\r\n        {\"key\": \"info\", \"match\": {\"text\": \"shoes\"}}\r\n    ]\r\n}\r\n\r\nresults = da.find(np.random.rand(n_dim), filter=filter)\r\n```\r\n\r\n### RGB-D representation of 3D meshes (#753)\r\n\r\nDocArray already supports [3D mesh representation in different formats](https:\u002F\u002Fdocarray.jina.ai\u002Fdatatypes\u002Fmesh\u002F) and this release adds support for RGB-D representation.\r\n\r\n```python\r\ndoc.load_uris_to_rgbd_tensor()\r\n```\r\n\r\n### Load multi page tiff files into chunks (#845)\r\n\r\nMulti page `tiff` images can now be loaded with `load_uri_to_image_tensor()`.\r\n\r\n```python\r\nd = Document(uri=\"foo.tiff\")\r\nd.load_uri_to_image_tensor()\r\nprint(d)\r\n```\r\n\r\n```\r\n\u003CDocument ('id', 'uri', 'chunks') at 7f907d786d6c11ec840a1e008a366d49>\r\n  └─ chunks\r\n     ├─ \u003CDocument ('id', 'parent_id', 'granularity', 'tensor') at 7aa4c0ba66cf6c300b7f07fdcbc2fdc8>\r\n     ├─ \u003CDocument ('id', 'parent_id', 'granularity', 'tensor') at bc94a3e3ca60352f2e4c9ab1b1bb9c22>\r\n     └─ \u003CDocument ('id', 'parent_id', 'granularity', 'tensor') at 36fe0d1daf4442ad6461c619f8bb25b7>\r\n```\r\n\r\n### Store key frame indices when loading video tensor from uri (#880)\r\n\r\n`key_frame_indices` are now stored in a Document's tags when [loading a video to tensor](https:\u002F\u002Fdocarray.jina.ai\u002Fdatatypes\u002Fvideo\u002F#load-video-data). This allows extracting the section of the video between key frames.\r\n\r\n```python\r\nd = Document(uri=\"video.mp4\").load_uri_to_video_tensor()\r\nprint(d.tags['keyframe_indices'])\r\n```\r\n\r\n```\r\n[0, 25, 196, ...]\r\n```\r\n\r\n### Better plotting of embeddings for nested and complex data (#891)\r\n\r\nYou can now choose which meta field parameters to exclude when calling DocumentArray's `plot_embedding()` method. This makes it easier to plot embeddings for complex and nested data.\r\n\r\n```python\r\ndocs.plot_embeddings(exclude_fields_metas=['chunks'])\r\n```\r\n\r\n### Better support for information retrieval evaluation (#826)\r\n\r\nThis release adds a `max_rel_per_label` parameter to better support metric calculations that require the number of relevant Documents. \r\n```python\r\nmetrics = da.evaluate(['recall_at_k'], max_rel_per_label={i: 1 for i in range(3)})\r\n```\r\n\r\n## 🐞 Bug Fixes\r\n\r\n### Support length calculation independently from list-like behavior  (#840)\r\n\r\n[DocArray 0.19](https:\u002F\u002Fgithub.com\u002Fdocarray\u002Fdocarray\u002Freleases\u002Ftag\u002Fv0.19.0) added the ability to instantiate a document store without list-like behavior for improved performance. However, calculating the length of certain document stores relied on such list-like behavior. This release fixes length calculation for the Redis document store, making it independent from list-like behavior.\r\n\r\n### Remove cosine similarity field with false assignment (#835)\r\nIn the Weaviate document store, cosine distance is no longer mistakenly assigned to the cosine_similarity field.\r\n\r\n### Rebuild index after clearing storage (#837)\r\nThe index for Redis and Elasticsearch document stores is now rebuilt when `_clear_storage` is called.\r\n\r\n## 📗 Documentation Improvements\r\n- Correct Document description (#842)\r\n- Minor correction in Document description (#834)\r\n- Add username to DocArray pull (#847)\r\n- Fix broken docs (#805)\r\n- Fix data management section (#801)\r\n- Change logic order according to blog (#797)\r\n- Move cloud support to integrations (#798)\r\n\r\n## 🤟 Contributors\r\n\r\nWe would like to thank all contributors to this release:\r\n- Delgermurun (@delgermurun)\r\n- Anne Yang (@AnneYang720)\r\n- anna-charlotte (@anna-charlotte)\r\n- Johannes Messner (@JohannesMessner)\r\n- Alex Cureton-Griffiths (@alexcg1)\r\n- AlaeddineAbdessalem (@alaeddine-13)\r\n- dong xiang (@dongxiang123)\r\n- coolmian (@coolmian)\r\n- Joan Fontanals (@JoanFM)\r\n- Nan Wang (@nan-wang)\r\n- samsja (@samsja)\r\n- Michael Günther (@guenthermi)\r\n","2022-12-07T12:16:02"]