[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-spotify--annoy":3,"tool-spotify--annoy":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85267,2,"2026-04-18T11:00:28",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[19,14,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},5773,"cs-video-courses","Developer-Y\u002Fcs-video-courses","cs-video-courses 是一个精心整理的计算机科学视频课程清单，旨在为自学者提供系统化的学习路径。它汇集了全球知名高校（如加州大学伯克利分校、新南威尔士大学等）的完整课程录像，涵盖从编程基础、数据结构与算法，到操作系统、分布式系统、数据库等核心领域，并深入延伸至人工智能、机器学习、量子计算及区块链等前沿方向。\n\n面对网络上零散且质量参差不齐的教学资源，cs-video-courses 解决了学习者难以找到成体系、高难度大学级别课程的痛点。该项目严格筛选内容，仅收录真正的大学层级课程，排除了碎片化的简短教程或商业广告，确保用户能接触到严谨的学术内容。\n\n这份清单特别适合希望夯实计算机基础的开发者、需要补充特定领域知识的研究人员，以及渴望像在校生一样系统学习计算机科学的自学者。其独特的技术亮点在于分类极其详尽，不仅包含传统的软件工程与网络安全，还细分了生成式 AI、大语言模型、计算生物学等新兴学科，并直接链接至官方视频播放列表，让用户能一站式获取高质量的教育资源，免费享受世界顶尖大学的课堂体验。",79792,"2026-04-08T22:03:59",[18,13,14,20],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",75872,"2026-04-18T10:54:57",[19,13,20,18],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":29,"last_commit_at":63,"category_tags":64,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,"2026-04-03T21:50:24",[20,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":80,"owner_website":82,"owner_url":83,"languages":84,"stars":112,"forks":113,"last_commit_at":114,"license":115,"difficulty_score":29,"env_os":116,"env_gpu":117,"env_ram":118,"env_deps":119,"category_tags":123,"github_topics":124,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":22,"created_at":132,"updated_at":133,"faqs":134,"releases":164},9187,"spotify\u002Fannoy","annoy","Approximate Nearest Neighbors in C++\u002FPython optimized for memory usage and loading\u002Fsaving to disk","Annoy 是一个由 Spotify 开源的高效近似最近邻搜索库，提供 C++ 核心与 Python 接口。它主要用于在海量高维向量数据中，快速查找与目标点最相似的邻居节点，广泛应用于推荐系统、图像检索等场景。\n\n面对传统算法在处理百万级数据时内存占用过高、多进程共享困难的问题，Annoy 给出了独特的解决方案。其核心亮点在于支持将索引构建为只读的静态文件，并通过内存映射（mmap）技术加载。这意味着索引只需构建一次，即可被多个进程同时共享读取，不仅极大降低了内存 footprint，还实现了毫秒级的启动速度。此外，Annoy 支持欧氏距离、余弦相似度等多种度量方式，即使在千维空间下也能保持优异的性能。\n\n这款工具非常适合后端开发者、算法工程师及数据科学家使用。如果你需要在生产环境中部署大规模相似性搜索服务，或者希望在 Hadoop 等分布式任务中高效复用索引文件，Annoy 凭借其轻量、快速且易于集成的特性，将成为你值得信赖的技术选择。","Annoy\n-----\n\n\n\n.. figure:: https:\u002F\u002Fraw.github.com\u002Fspotify\u002Fannoy\u002Fmaster\u002Fann.png\n   :alt: Annoy example\n   :align: center\n\n.. image:: https:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg\n   :target: https:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Factions\n\nAnnoy (`Approximate Nearest Neighbors \u003Chttp:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNearest_neighbor_search#Approximate_nearest_neighbor>`__ Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are `mmapped \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMmap>`__ into memory so that many processes may share the same data.\n\nInstall\n-------\n\nTo install, simply do ``pip install --user annoy`` to pull down the latest version from `PyPI \u003Chttps:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fannoy>`_.\n\nFor the C++ version, just clone the repo and ``#include \"annoylib.h\"``.\n\nBackground\n----------\n\nThere are some other libraries to do nearest neighbor search. Annoy is almost as fast as the fastest libraries, (see below), but there is actually another feature that really sets Annoy apart: it has the ability to **use static files as indexes**. In particular, this means you can **share index across processes**. Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. Another nice thing of Annoy is that it tries to minimize memory footprint so the indexes are quite small.\n\nWhy is this useful? If you want to find nearest neighbors and you have many CPU's, you only need to build the index once. You can also pass around and distribute static files to use in production environment, in Hadoop jobs, etc. Any process will be able to load (mmap) the index into memory and will be able to do lookups immediately.\n\nWe use it at `Spotify \u003Chttp:\u002F\u002Fwww.spotify.com\u002F>`__ for music recommendations. After running matrix factorization algorithms, every user\u002Fitem can be represented as a vector in f-dimensional space. This library helps us search for similar users\u002Fitems. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.\n\nAnnoy was built by `Erik Bernhardsson \u003Chttp:\u002F\u002Fwww.erikbern.com>`__ in a couple of afternoons during `Hack Week \u003Chttp:\u002F\u002Flabs.spotify.com\u002F2013\u002F02\u002F15\u002Forganizing-a-hack-week\u002F>`__.\n\nSummary of features\n-------------------\n\n* `Euclidean distance \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEuclidean_distance>`__, `Manhattan distance \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTaxicab_geometry>`__, `cosine distance \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCosine_similarity>`__, `Hamming distance \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHamming_distance>`__, or `Dot (Inner) Product distance \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDot_product>`__\n* Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))\n* Works better if you don't have too many dimensions (like \u003C100) but seems to perform surprisingly well even up to 1,000 dimensions\n* Small memory usage\n* Lets you share memory between multiple processes\n* Index creation is separate from lookup (in particular you can not add more items once the tree has been created)\n* Native Python support, tested with 2.7, 3.6, and 3.7.\n* Build index on disk to enable indexing big datasets that won't fit into memory (contributed by `Rene Hollander \u003Chttps:\u002F\u002Fgithub.com\u002FReneHollander>`__)\n\nPython code example\n-------------------\n\n.. code-block:: python\n\n  from annoy import AnnoyIndex\n  import random\n\n  f = 40  # Length of item vector that will be indexed\n\n  t = AnnoyIndex(f, 'angular')\n  for i in range(1000):\n      v = [random.gauss(0, 1) for z in range(f)]\n      t.add_item(i, v)\n\n  t.build(10) # 10 trees\n  t.save('test.ann')\n\n  # ...\n\n  u = AnnoyIndex(f, 'angular')\n  u.load('test.ann') # super fast, will just mmap the file\n  print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors\n\nRight now it only accepts integers as identifiers for items. Note that it will allocate memory for max(id)+1 items because it assumes your items are numbered 0 … n-1. If you need other id's, you will have to keep track of a map yourself.\n\nFull Python API\n---------------\n\n* ``AnnoyIndex(f, metric)`` returns a new index that's read-write and stores vector of ``f`` dimensions. Metric can be ``\"angular\"``, ``\"euclidean\"``, ``\"manhattan\"``, ``\"hamming\"``, or ``\"dot\"``.\n* ``a.add_item(i, v)`` adds item ``i`` (any nonnegative integer) with vector ``v``. Note that it will allocate memory for ``max(i)+1`` items.\n* ``a.build(n_trees, n_jobs=-1)`` builds a forest of ``n_trees`` trees. More trees gives higher precision when querying. After calling ``build``, no more items can be added. ``n_jobs`` specifies the number of threads used to build the trees. ``n_jobs=-1`` uses all available CPU cores.\n* ``a.save(fn, prefault=False)`` saves the index to disk and loads it (see next function). After saving, no more items can be added.\n* ``a.load(fn, prefault=False)`` loads (mmaps) an index from disk. If `prefault` is set to `True`, it will pre-read the entire file into memory (using mmap with `MAP_POPULATE`). Default is `False`.\n* ``a.unload()`` unloads.\n* ``a.get_nns_by_item(i, n, search_k=-1, include_distances=False)`` returns the ``n`` closest items. During the query it will inspect up to ``search_k`` nodes which defaults to ``n_trees * n`` if not provided. ``search_k`` gives you a run-time tradeoff between better accuracy and speed. If you set ``include_distances`` to ``True``, it will return a 2 element tuple with two lists in it: the second one containing all corresponding distances.\n* ``a.get_nns_by_vector(v, n, search_k=-1, include_distances=False)`` same but query by vector ``v``.\n* ``a.get_item_vector(i)`` returns the vector for item ``i`` that was previously added.\n* ``a.get_distance(i, j)`` returns the distance between items ``i`` and ``j``. NOTE: this used to return the *squared* distance, but has been changed as of Aug 2016.\n* ``a.get_n_items()`` returns the number of items in the index.\n* ``a.get_n_trees()`` returns the number of trees in the index.\n* ``a.on_disk_build(fn)`` prepares annoy to build the index in the specified file instead of RAM (execute before adding items, no need to save after build)\n* ``a.set_seed(seed)`` will initialize the random number generator with the given seed.  Only used for building up the tree, i. e. only necessary to pass this before adding the items.  Will have no effect after calling `a.build(n_trees)` or `a.load(fn)`.\n\nNotes:\n\n* There's no bounds checking performed on the values so be careful.\n* Annoy uses Euclidean distance of normalized vectors for its angular distance, which for two vectors u,v is equal to ``sqrt(2(1-cos(u,v)))``\n\n\nThe C++ API is very similar: just ``#include \"annoylib.h\"`` to get access to it.\n\nTradeoffs\n---------\n\nThere are just two main parameters needed to tune Annoy: the number of trees ``n_trees`` and the number of nodes to inspect during searching ``search_k``.\n\n* ``n_trees`` is provided during build time and affects the build time and the index size. A larger value will give more accurate results, but larger indexes.\n* ``search_k`` is provided in runtime and affects the search performance. A larger value will give more accurate results, but will take longer time to return.\n\nIf ``search_k`` is not provided, it will default to ``n * n_trees`` where ``n`` is the number of approximate nearest neighbors. Otherwise, ``search_k`` and ``n_trees`` are roughly independent, i.e. the value of ``n_trees`` will not affect search time if ``search_k`` is held constant and vice versa. Basically it's recommended to set ``n_trees`` as large as possible given the amount of memory you can afford, and it's recommended to set ``search_k`` as large as possible given the time constraints you have for the queries.\n\nYou can also accept slower search times in favour of reduced loading times, memory usage, and disk IO. On supported platforms the index is prefaulted during ``load`` and ``save``, causing the file to be pre-emptively read from disk into memory. If you set ``prefault`` to ``False``, pages of the mmapped index are instead read from disk and cached in memory on-demand, as necessary for a search to complete. This can significantly increase early search times but may be better suited for systems with low memory compared to index size, when few queries are executed against a loaded index, and\u002For when large areas of the index are unlikely to be relevant to search queries.\n\n\nHow does it work\n----------------\n\nUsing `random projections \u003Chttp:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLocality-sensitive_hashing#Random_projection>`__ and by building up a tree. At every intermediate node in the tree, a random hyperplane is chosen, which divides the space into two subspaces. This hyperplane is chosen by sampling two points from the subset and taking the hyperplane equidistant from them.\n\nWe do this k times so that we get a forest of trees. k has to be tuned to your need, by looking at what tradeoff you have between precision and performance.\n\nHamming distance (contributed by `Martin Aumüller \u003Chttps:\u002F\u002Fgithub.com\u002Fmaumueller>`__) packs the data into 64-bit integers under the hood and uses built-in bit count primitives so it could be quite fast. All splits are axis-aligned.\n\nDot Product distance (contributed by `Peter Sobot \u003Chttps:\u002F\u002Fgithub.com\u002Fpsobot>`__ and `Pavel Korobov \u003Chttps:\u002F\u002Fgithub.com\u002Fpkorobov>`__) reduces the provided vectors from dot (or \"inner-product\") space to a more query-friendly cosine space using `a method by Bachrach et al., at Microsoft Research, published in 2014 \u003Chttps:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fwp-content\u002Fuploads\u002F2016\u002F02\u002FXboxInnerProduct.pdf>`__.\n\n\n\nMore info\n---------\n\n* `Dirk Eddelbuettel \u003Chttps:\u002F\u002Fgithub.com\u002Feddelbuettel>`__ provides an `R version of Annoy \u003Chttp:\u002F\u002Fdirk.eddelbuettel.com\u002Fcode\u002Frcpp.annoy.html>`__.\n* `Andy Sloane \u003Chttps:\u002F\u002Fgithub.com\u002Fa1k0n>`__ provides a `Java version of Annoy \u003Chttps:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy-java>`__ although currently limited to cosine and read-only.\n* `Pishen Tsai \u003Chttps:\u002F\u002Fgithub.com\u002Fpishen>`__ provides a `Scala wrapper of Annoy \u003Chttps:\u002F\u002Fgithub.com\u002Fpishen\u002Fannoy4s>`__ which uses JNA to call the C++ library of Annoy.\n* `Atsushi Tatsuma \u003Chttps:\u002F\u002Fgithub.com\u002Fyoshoku>`__ provides `Ruby bindings for Annoy \u003Chttps:\u002F\u002Fgithub.com\u002Fyoshoku\u002Fannoy.rb>`__.\n* There is `experimental support for Go \u003Chttps:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Fblob\u002Fmaster\u002FREADME_GO.rst>`__ provided by `Taneli Leppä \u003Chttps:\u002F\u002Fgithub.com\u002Frosmo>`__.\n* `Boris Nagaev \u003Chttps:\u002F\u002Fgithub.com\u002Fstarius>`__ wrote `Lua bindings \u003Chttps:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Fblob\u002Fmaster\u002FREADME_Lua.md>`__.\n* During part of Spotify Hack Week 2016 (and a bit afterward), `Jim Kang \u003Chttps:\u002F\u002Fgithub.com\u002Fjimkang>`__ wrote `Node bindings \u003Chttps:\u002F\u002Fgithub.com\u002Fjimkang\u002Fannoy-node>`__ for Annoy.\n* `Min-Seok Kim \u003Chttps:\u002F\u002Fgithub.com\u002Fmskimm>`__ built a `Scala version \u003Chttps:\u002F\u002Fgithub.com\u002Fmskimm\u002Fann4s>`__ of Annoy.\n* `hanabi1224 \u003Chttps:\u002F\u002Fgithub.com\u002Fhanabi1224>`__ built a read-only `Rust version \u003Chttps:\u002F\u002Fgithub.com\u002Fhanabi1224\u002FRuAnnoy>`__ of Annoy, together with **dotnet, jvm and dart** read-only bindings.\n* `Presentation from New York Machine Learning meetup \u003Chttp:\u002F\u002Fwww.slideshare.net\u002Ferikbern\u002Fapproximate-nearest-neighbor-methods-and-vector-models-nyc-ml-meetup>`__ about Annoy\n* Annoy is available as a `conda package \u003Chttps:\u002F\u002Fanaconda.org\u002Fconda-forge\u002Fpython-annoy>`__ on Linux, OS X, and Windows.\n* `ann-benchmarks \u003Chttps:\u002F\u002Fgithub.com\u002Ferikbern\u002Fann-benchmarks>`__ is a benchmark for several approximate nearest neighbor libraries. Annoy seems to be fairly competitive, especially at higher precisions:\n\n.. figure:: https:\u002F\u002Fraw.githubusercontent.com\u002Ferikbern\u002Fann-benchmarks\u002Fmain\u002Fresults\u002Fglove-100-angular.png\n   :alt: ANN benchmarks\n   :align: center\n   :target: https:\u002F\u002Fgithub.com\u002Ferikbern\u002Fann-benchmarks\n\nSource code\n-----------\n\nIt's all written in C++ with a handful of ugly optimizations for performance and memory usage. You have been warned :)\n\nThe code should support Windows, thanks to `Qiang Kou \u003Chttps:\u002F\u002Fgithub.com\u002Fthirdwing>`__ and `Timothy Riley \u003Chttps:\u002F\u002Fgithub.com\u002Ftjrileywisc>`__.\n\nTo run the tests, execute `python setup.py nosetests`. The test suite includes a big real world dataset that is downloaded from the internet, so it will take a few minutes to execute.\n\nDiscuss\n-------\n\nFeel free to post any questions or comments to the `annoy-user \u003Chttps:\u002F\u002Fgroups.google.com\u002Fgroup\u002Fannoy-user>`__ group. I'm `@fulhack \u003Chttps:\u002F\u002Ftwitter.com\u002Ffulhack>`__ on Twitter.\n","Annoy\n-----\n\n\n\n.. figure:: https:\u002F\u002Fraw.github.com\u002Fspotify\u002Fannoy\u002Fmaster\u002Fann.png\n   :alt: Annoy 示例\n   :align: center\n\n.. image:: https:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg\n   :target: https:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Factions\n\nAnnoy（近似最近邻 Oh Yeah）是一个 C++ 库，并带有 Python 绑定，用于在空间中搜索与给定查询点距离较近的点。它还可以创建大型只读的基于文件的数据结构，这些数据结构会被 `mmap \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMmap>`__ 映射到内存中，从而允许多个进程共享同一份数据。\n\n安装\n-------\n\n要安装，只需运行 ``pip install --user annoy``，即可从 `PyPI \u003Chttps:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fannoy>`_ 下载最新版本。\n\n对于 C++ 版本，只需克隆仓库并 ``#include \"annoylib.h\"`` 即可。\n\n背景\n----------\n\n目前已有其他一些库可以进行最近邻搜索。Annoy 的速度几乎与最快的库相当（见下文），但真正让 Annoy 脱颖而出的是它的另一项特性：它能够 **使用静态文件作为索引**。具体来说，这意味着你可以 **在多个进程之间共享索引**。Annoy 还将索引的创建与加载分离，因此你可以将索引以文件形式传递，并快速将其映射到内存中。Annoy 的另一个优点是它会尽量减少内存占用，因此生成的索引文件通常非常小。\n\n这有什么用呢？如果你需要查找最近邻，并且拥有多个 CPU 核心，那么你只需要构建一次索引即可。此外，你还可以将静态索引文件分发到生产环境、Hadoop 作业等场景中使用。任何进程都可以将索引文件通过 `mmap` 映射到内存中，然后立即进行查询。\n\n我们在 `Spotify \u003Chttp:\u002F\u002Fwww.spotify.com\u002F>`__ 将其用于音乐推荐。在运行矩阵分解算法后，每个用户或物品都可以表示为 f 维空间中的一个向量。这个库帮助我们搜索相似的用户或物品。我们的高维空间中包含数百万首曲目，因此内存使用是一个首要考虑的问题。\n\nAnnoy 是由 `Erik Bernhardsson \u003Chttp:\u002F\u002Fwww.erikbern.com>`__ 在一次 `Hack Week \u003Chttp:\u002F\u002Flabs.spotify.com\u002F2013\u002F02\u002F15\u002Forganizing-a-hack-week\u002F>`__ 的几个下午时间里开发出来的。\n\n功能概览\n-------------------\n\n* `欧几里得距离 \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEuclidean_distance>`__、`曼哈顿距离 \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTaxicab_geometry>`__、`余弦距离 \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCosine_similarity>`__、`汉明距离 \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHamming_distance>`__ 或者 `点积距离 \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDot_product>`__\n* 余弦距离等价于归一化向量之间的欧几里得距离 = sqrt(2-2*cos(u, v))\n* 如果维度不太高（比如少于 100 维）效果更好，但即使高达 1,000 维也能表现出乎意料地好\n* 内存占用小\n* 允许多个进程共享内存\n* 索引创建与查询分离（一旦树结构建立，就无法再添加新条目）\n* 原生支持 Python，已在 2.7、3.6 和 3.7 上测试通过。\n* 可以在磁盘上构建索引，从而支持无法完全放入内存的大规模数据集索引（由 `Rene Hollander \u003Chttps:\u002F\u002Fgithub.com\u002FReneHollander>`__ 贡献）\n\nPython 代码示例\n-------------------\n\n.. code-block:: python\n\n  from annoy import AnnoyIndex\n  import random\n\n  f = 40  # 将被索引的向量长度\n\n  t = AnnoyIndex(f, 'angular')\n  for i in range(1000):\n      v = [random.gauss(0, 1) for z in range(f)]\n      t.add_item(i, v)\n\n  t.build(10) # 10 棵树\n  t.save('test.ann')\n\n  # ...\n\n  u = AnnoyIndex(f, 'angular')\n  u.load('test.ann') # 非常快速，直接通过 mmap 映射文件\n  print(u.get_nns_by_item(0, 1000)) # 找到最接近的 1000 个邻居\n\n目前它只接受整数作为条目的标识符。请注意，它会为 max(id)+1 个条目分配内存，因为它假设你的条目编号是 0 … n-1。如果你需要其他类型的 ID，就需要自己维护一个映射表。\n\n完整的 Python API\n---------------\n\n* ``AnnoyIndex(f, metric)`` 返回一个新的读写索引，存储 f 维的向量。度量方式可以是 ``\"angular\"``、``\"euclidean\"``、``\"manhattan\"``、``\"hamming\"`` 或 ``\"dot\"``。\n* ``a.add_item(i, v)`` 添加条目 ``i``（任意非负整数）及其对应的向量 ``v``。注意，它会为 ``max(i)+1`` 个条目分配内存。\n* ``a.build(n_trees, n_jobs=-1)`` 构建包含 ``n_trees`` 棵树的森林。树的数量越多，查询精度越高。调用 ``build`` 后，将不能再添加新的条目。``n_jobs`` 指定用于构建树的线程数量。``n_jobs=-1`` 表示使用所有可用的 CPU 核心。\n* ``a.save(fn, prefault=False)`` 将索引保存到磁盘，并加载它（参见下一个函数）。保存后，将不能再添加新的条目。\n* ``a.load(fn, prefault=False)`` 从磁盘加载（通过 mmap）索引。如果将 `prefault` 设置为 `True`，它会预先将整个文件读入内存（使用带有 `MAP_POPULATE` 标志的 mmap）。默认值为 `False`。\n* ``a.unload()`` 卸载索引。\n* ``a.get_nns_by_item(i, n, search_k=-1, include_distances=False)`` 返回最接近的 ``n`` 个条目。查询时会检查最多 ``search_k`` 个节点，默认值为 ``n_trees * n``。``search_k`` 提供了运行时的权衡：更高的精度和更快的速度之间的取舍。如果将 `include_distances` 设置为 `True`，它将返回一个包含两个列表的元组：第二个列表包含所有对应的距离。\n* ``a.get_nns_by_vector(v, n, search_k=-1, include_distances=False)`` 类似，但通过向量 ``v`` 进行查询。\n* ``a.get_item_vector(i)`` 返回之前添加的条目 ``i`` 的向量。\n* ``a.get_distance(i, j)`` 返回条目 ``i`` 和 ``j`` 之间的距离。注意：此方法过去返回的是 *平方* 距离，但从 2016 年 8 月起已更改为返回实际距离。\n* ``a.get_n_items()`` 返回索引中的条目总数。\n* ``a.get_n_trees()`` 返回索引中的树的数量。\n* ``a.on_disk_build(fn)`` 准备 Annoy 在指定文件中而非内存中构建索引（在添加条目前执行，构建完成后无需保存）。\n* ``a.set_seed(seed)`` 使用给定的种子初始化随机数生成器。仅在构建树时使用，即只有在添加条目之前才需要设置。调用 `a.build(n_trees)` 或 `a.load(fn)` 后将不再生效。\n\n注意事项：\n\n* 对输入值没有进行边界检查，请务必小心。\n* Annoy 的角距离实际上是归一化向量之间的欧几里得距离，对于两个向量 u 和 v，其值为 ``sqrt(2(1-cos(u,v)))``\n\n\nC++ API 非常类似：只需 ``#include \"annoylib.h\"`` 即可访问。\n\n权衡\n---------\n\n调整 Annoy 的主要参数只有两个：树的数量 ``n_trees`` 和查询时检查的节点数量 ``search_k``。\n\n* ``n_trees`` 在构建时指定，会影响构建时间和索引大小。较大的值会提高搜索精度，但也会导致索引变大。\n* ``search_k`` 在运行时指定，影响搜索性能。较大的值会提高搜索精度，但返回结果所需的时间也会更长。\n\n如果未指定 ``search_k``，它将默认为 ``n * n_trees``，其中 ``n`` 是近似最近邻的数量。否则，``search_k`` 和 ``n_trees`` 大致是独立的，即在 ``search_k`` 保持不变的情况下，``n_trees`` 的值不会影响搜索时间，反之亦然。因此，建议在内存允许的情况下尽可能增大 ``n_trees``；而在查询时间有限的情况下，则应尽可能增大 ``search_k``。\n\n你也可以选择接受较慢的搜索速度，以换取更快的加载速度、更低的内存占用和更少的磁盘 I\u002FO。在支持的平台上，索引会在 ``load`` 和 ``save`` 时进行预取，从而提前将文件从磁盘读入内存。如果你将 ``prefault`` 设置为 ``False``，则会按需从磁盘读取内存映射索引的页面，并将其缓存在内存中，直到完成一次搜索为止。这可能会显著增加首次搜索的时间，但对于内存容量远小于索引大小的系统、对已加载索引执行少量查询的情况，或者索引中大部分区域不太可能与搜索查询相关的情形，这种方式可能更为合适。\n\n\n它是如何工作的\n----------------\n\n通过使用随机投影（`random projections \u003Chttp:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLocality-sensitive_hashing#Random_projection>`__）并构建一棵树来实现。在树的每个中间节点上，都会选择一个随机超平面，将空间划分为两个子空间。这个超平面是通过从子集中随机采样两点，并取这两点之间的等距超平面来确定的。\n\n我们重复这一过程 k 次，从而得到一棵森林。k 的具体值需要根据你的需求进行调整，权衡精度和性能之间的折衷。\n\n汉明距离（由 `Martin Aumüller \u003Chttps:\u002F\u002Fgithub.com\u002Fmaumueller>`__ 贡献）会在底层将数据打包成 64 位整数，并利用内置的位计数原语，因此速度相当快。所有的分割都是轴对齐的。\n\n点积距离（由 `Peter Sobot \u003Chttps:\u002F\u002Fgithub.com\u002Fpsobot>`__ 和 `Pavel Korobov \u003Chttps:\u002F\u002Fgithub.com\u002Fpkorobov>`__ 贡献）使用微软研究院 Bachrach 等人在 2014 年发表的方法，将提供的向量从点积（或“内积”）空间转换为更适合查询的余弦空间：`a method by Bachrach et al., at Microsoft Research, published in 2014 \u003Chttps:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fwp-content\u002Fuploads\u002F2016\u002F02\u002FXboxInnerProduct.pdf>`__。\n\n\n\n更多信息\n---------\n\n* `Dirk Eddelbuettel \u003Chttps:\u002F\u002Fgithub.com\u002Feddelbuettel>`__ 提供了 `Annoy 的 R 版本 \u003Chttp:\u002F\u002Fdirk.eddelbuettel.com\u002Fcode\u002Frcpp.annoy.html>`__。\n* `Andy Sloane \u003Chttps:\u002F\u002Fgithub.com\u002Fa1k0n>`__ 提供了 `Annoy 的 Java 版本 \u003Chttps:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy-java>`__，不过目前仅支持余弦距离且为只读模式。\n* `Pishen Tsai \u003Chttps:\u002F\u002Fgithub.com\u002Fpishen>`__ 提供了 `Annoy 的 Scala 封装 \u003Chttps:\u002F\u002Fgithub.com\u002Fpishen\u002Fannoy4s>`__，该封装使用 JNA 调用 Annoy 的 C++ 库。\n* `Atsushi Tatsuma \u003Chttps:\u002F\u002Fgithub.com\u002Fyoshoku>`__ 提供了 `Annoy 的 Ruby 绑定 \u003Chttps:\u002F\u002Fgithub.com\u002Fyoshoku\u002Fannoy.rb>`__。\n* `Taneli Leppä \u003Chttps:\u002F\u002Fgithub.com\u002Frosmo>`__ 提供了 `Go 语言的实验性支持 \u003Chttps:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Fblob\u002Fmaster\u002FREADME_GO.rst>`__。\n* `Boris Nagaev \u003Chttps:\u002F\u002Fgithub.com\u002Fstarius>`__ 编写了 `Lua 绑定 \u003Chttps:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Fblob\u002Fmaster\u002FREADME_Lua.md>`__。\n* 在 Spotify Hack Week 2016 的部分时间里（以及之后的一段时间），`Jim Kang \u003Chttps:\u002F\u002Fgithub.com\u002Fjimkang>`__ 为 Annoy 编写了 `Node.js 绑定 \u003Chttps:\u002F\u002Fgithub.com\u002Fjimkang\u002Fannoy-node>`__。\n* `Min-Seok Kim \u003Chttps:\u002F\u002Fgithub.com\u002Fmskimm>`__ 构建了 Annoy 的 `Scala 版本 \u003Chttps:\u002F\u002Fgithub.com\u002Fmskimm\u002Fann4s>`__。\n* `hanabi1224 \u003Chttps:\u002F\u002Fgithub.com\u002Fhanabi1224>`__ 构建了 Annoy 的只读 `Rust 版本 \u003Chttps:\u002F\u002Fgithub.com\u002Fhanabi1224\u002FRuAnnoy>`__，同时还提供了 **dotnet、jvm 和 dart** 的只读绑定。\n* 关于 Annoy 的 `纽约机器学习聚会演讲 \u003Chttp:\u002F\u002Fwww.slideshare.net\u002Ferikbern\u002Fapproximate-nearest-neighbor-methods-and-vector-models-nyc-ml-meetup>`__。\n* Annoy 可以作为 `conda 包 \u003Chttps:\u002F\u002Fanaconda.org\u002Fconda-forge\u002Fpython-annoy>`__ 在 Linux、OS X 和 Windows 上使用。\n* `ann-benchmarks \u003Chttps:\u002F\u002Fgithub.com\u002Ferikbern\u002Fann-benchmarks>`__ 是一个针对多种近似最近邻库的基准测试工具。Annoy 在其中表现相当有竞争力，尤其是在高精度场景下：\n\n.. figure:: https:\u002F\u002Fraw.githubusercontent.com\u002Ferikbern\u002Fann-benchmarks\u002Fmain\u002Fresults\u002Fglove-100-angular.png\n   :alt: ANN 基准测试结果\n   :align: center\n   :target: https:\u002F\u002Fgithub.com\u002Ferikbern\u002Fann-benchmarks\n\n源代码\n-----------\n\n整个项目完全用 C++ 编写，并包含一些为了提升性能和减少内存占用而做的优化措施。请务必注意这一点 :)\n\n感谢 `Qiang Kou \u003Chttps:\u002F\u002Fgithub.com\u002Fthirdwing>`__ 和 `Timothy Riley \u003Chttps:\u002F\u002Fgithub.com\u002Ftjrileywisc>`__ 的努力，代码现在也支持 Windows 系统。\n\n要运行测试，请执行 `python setup.py nosetests`。测试套件中包含一个从互联网下载的大型真实世界数据集，因此执行起来可能需要几分钟时间。\n\n讨论\n-------\n\n欢迎随时在 `annoy-user \u003Chttps:\u002F\u002Fgroups.google.com\u002Fgroup\u002Fannoy-user>`__ 论坛上提出问题或发表评论。我在 Twitter 上的账号是 `@fulhack \u003Chttps:\u002F\u002Ftwitter.com\u002Ffulhack>`__。","# Annoy 快速上手指南\n\nAnnoy (Approximate Nearest Neighbors Oh Yeah) 是由 Spotify 开源的高性能近似最近邻搜索库。它支持构建基于磁盘的只读索引文件，并通过内存映射（mmap）实现多进程共享，特别适合大规模向量检索场景。\n\n## 环境准备\n\n*   **操作系统**：Linux, macOS, Windows\n*   **Python 版本**：支持 Python 2.7, 3.6, 3.7 及更高版本\n*   **依赖**：\n    *   Python 用户：仅需 `pip` 包管理工具。\n    *   C++ 用户：需要标准 C++ 编译器。\n*   **硬件建议**：适用于维度小于 100 的场景（最高支持至 1000 维），对内存占用有严格限制的场景尤为适合。\n\n## 安装步骤\n\n### Python 安装\n推荐使用国内镜像源加速安装：\n\n```bash\npip install --user annoy -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n或者使用官方源：\n\n```bash\npip install --user annoy\n```\n\n### C++ 集成\n无需安装命令，直接克隆仓库并在代码中引入头文件：\n\n```cpp\n#include \"annoylib.h\"\n```\n\n## 基本使用\n\n以下是最简单的 Python 使用示例，涵盖索引构建、保存、加载及查询全过程。\n\n### 1. 构建并保存索引\n\n```python\nfrom annoy import AnnoyIndex\nimport random\n\nf = 40  # 向量维度\n\n# 创建索引对象，指定维度和距离度量方式 ('angular', 'euclidean', 'manhattan', 'hamming', 'dot')\nt = AnnoyIndex(f, 'angular')\n\n# 添加数据项 (id, vector)\nfor i in range(1000):\n    v = [random.gauss(0, 1) for z in range(f)]\n    t.add_item(i, v)\n\n# 构建树森林 (树的数量越多，精度越高，但构建时间和索引体积越大)\nt.build(10) \n\n# 保存到磁盘\nt.save('test.ann')\n```\n\n### 2. 加载索引并查询\n\n```python\nfrom annoy import AnnoyIndex\n\nf = 40\nu = AnnoyIndex(f, 'angular')\n\n# 加载索引 (速度极快，底层使用 mmap 映射文件，支持多进程共享)\nu.load('test.ann') \n\n# 查找与 item 0 最近的 1000 个邻居\nneighbors = u.get_nns_by_item(0, 1000)\nprint(neighbors)\n\n# 也可以通过向量直接查询\n# query_vector = [...] \n# neighbors = u.get_nns_by_vector(query_vector, 1000)\n```\n\n### 核心参数说明\n*   **n_trees (构建时)**：构建的树的数量。值越大精度越高，但索引文件越大，构建时间越长。建议在内存允许范围内尽可能调大。\n*   **search_k (查询时)**：搜索时检查的节点数量。默认值为 `n_trees * n` (n 为请求返回的邻居数)。值越大精度越高，但查询耗时增加。可根据实时性要求在查询时动态调整。","某大型电商平台的推荐系统团队需要为千万级商品库构建实时“猜你喜欢”功能，通过计算商品向量相似度来快速召回相关物品。\n\n### 没有 annoy 时\n- **内存爆炸**：全量商品向量加载到内存后占用数十 GB，导致单台服务器无法承载，必须拆分服务或升级昂贵硬件。\n- **启动缓慢**：每次服务重启或扩容时，重新加载和构建索引耗时数分钟，无法满足弹性伸缩的秒级需求。\n- **资源浪费**：多进程部署模式下，每个进程独立复制一份索引数据，造成巨大的内存冗余和浪费。\n- **更新僵化**：难以在不中断服务的情况下动态切换新索引，迭代算法模型时运维风险极高。\n\n### 使用 annoy 后\n- **极致省存**：annoy 将索引构建为紧凑的只读文件，内存占用降低一个数量级，普通实例即可轻松承载千万数据。\n- **秒级加载**：利用 mmap 技术直接将磁盘文件映射到内存，服务启动或扩容时索引加载几乎瞬间完成。\n- **共享内存**：多个工作进程可直接共享同一份 mmap 索引文件，彻底消除多进程间的内存重复开销。\n- **平滑切换**：索引构建与加载完全解耦，后台生成新文件后通过原子替换即可实现无感知的模型热更新。\n\nannoy 通过基于磁盘文件的内存映射机制，以极低的内存成本实现了海量高维数据的毫秒级近似最近邻搜索，让大规模推荐系统在低成本硬件上也能高效运行。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fspotify_annoy_3f827a13.png","spotify","Spotify","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fspotify_6ec80015.png","",null,"opensource@spotify.com","https:\u002F\u002Fspotify.github.io\u002F","https:\u002F\u002Fgithub.com\u002Fspotify",[85,89,93,97,101,105,108],{"name":86,"color":87,"percentage":88},"C++","#f34b7d",49,{"name":90,"color":91,"percentage":92},"Python","#3572A5",30.5,{"name":94,"color":95,"percentage":96},"Lua","#000080",10.3,{"name":98,"color":99,"percentage":100},"Go","#00ADD8",4.7,{"name":102,"color":103,"percentage":104},"C","#555555",3.3,{"name":106,"color":80,"percentage":107},"SWIG",1.7,{"name":109,"color":110,"percentage":111},"CMake","#DA3434",0.5,14221,1220,"2026-04-18T04:01:34","Apache-2.0","Linux, macOS, Windows","未说明","未说明（特点为小内存占用，支持将索引构建在磁盘上以处理无法放入内存的大数据集）",{"notes":120,"python":121,"dependencies":122},"这是一个 C++ 库，提供 Python 绑定。核心特性是支持基于文件的只读索引并通过 mmap 映射到内存，允许多个进程共享同一索引数据。安装可通过 pip 或 conda 进行。索引一旦构建完成便不可添加新物品。支持多种距离度量（欧几里得、曼哈顿、余弦、汉明、点积）。建议在内存有限或需要多进程共享数据的场景下使用。","2.7, 3.6, 3.7",[],[18],[125,126,127,128,129,130,131],"c-plus-plus","python","nearest-neighbor-search","locality-sensitive-hashing","approximate-nearest-neighbor-search","golang","lua","2026-03-27T02:49:30.150509","2026-04-19T03:06:37.808402",[135,140,145,150,155,160],{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},41246,"如何在 Windows 上安装 Annoy？","在 Windows 上安装时可能会遇到缺少 'unistd.h' 或编译器错误。解决方案是修改 distutils 配置文件以使用 MSVC 编译器。具体步骤：找到文件 \"C:\\Program Files\\Anaconda3\\Lib\\distutils\\distutils.cfg\"（路径可能因安装位置而异），将其中的 compiler 设置为 msvc。此外，确保已安装适用于 Python 的 Microsoft Visual C++ 构建工具。","https:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Fissues\u002F130",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},41247,"为什么首次查询向量或物品时速度非常慢？","首次查询慢通常是因为内存映射（mmap）的文件尚未完全加载到 RAM 中，导致磁盘 I\u002FO 延迟。解决方法是预热页面缓存。你可以运行命令 `cat index.ann > \u002Fdev\u002Fnull` 来强制加载文件，或者在代码中遍历所有索引并调用 `get_vector` 进行预热。如果在 Google App Engine 等受限环境中，尝试增加实例的 RAM 或使用 tmpfs 挂载点也能解决此问题。","https:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Fissues\u002F376",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},41248,"遇到错误 'OSError: Index size is not a multiple of vector size' 怎么办？","该错误表明索引文件已损坏或不完整，导致文件大小不是向量大小的整数倍。这通常发生在索引写入过程中被中断或文件传输不完整时。解决方法是删除当前的索引文件并重新构建索引。请确保在保存索引时程序正常退出，没有发生崩溃。","https:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Fissues\u002F423",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},41249,"在 Windows 上安装时出现 'fatal error C1083: Cannot open include file: unistd.h' 错误如何解决？","`unistd.h` 是 Linux\u002FUnix 特有的头文件，Windows 默认不存在。虽然 Annoy 应该处理了平台兼容性，但如果直接通过 pip 安装源码包失败，建议检查是否使用了正确的编译器环境。如果问题依旧，可以尝试安装预编译的二进制包（如果有），或者配置环境变量使用 MinGW 并确保安装了相应的 Windows 兼容层，但最推荐的方法是参考官方针对 Windows 的安装指南，通常需要安装 Visual Studio Build Tools 并让 setup.py 自动处理条件编译。","https:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Fissues\u002F97",{"id":156,"question_zh":157,"answer_zh":158,"source_url":159},41250,"运行测试套件时出现断言错误（如 test_get_nns_by_item 失败）是什么原因？","测试失败通常是由于随机性导致的边界情况，或者是底层 C++ 代码中数组越界访问引起的（例如使用了 `T v[N]` 但访问超出了 N）。维护者通常会修复此类内存安全问题。如果遇到此类问题，请确保使用的是最新版本的代码，因为维护者会定期修复测试中的竞争条件和内存错误。如果是开发贡献者，需检查是否存在数组索引越界的写法。","https:\u002F\u002Fgithub.com\u002Fspotify\u002Fannoy\u002Fissues\u002F13",{"id":161,"question_zh":162,"answer_zh":163,"source_url":144},41251,"prefault 选项在所有平台上都支持吗？","不，`prefault` 选项并不在所有平台上都受支持。它主要用于在加载索引时预先将内存页调入物理内存，以减少首次访问的延迟。在某些操作系统（如部分 Windows 版本或特定的云环境）上，该标志可能被忽略或不起作用。如果在这些平台上遇到性能问题，建议手动执行文件读取操作（如 `cat file > \u002Fdev\u002Fnull`）来达到类似的预热效果。",[165,170,175,180,185,190,195,200,204,209,213,218,223,228,233,238,243,248,253,258],{"id":166,"version":167,"summary_zh":168,"released_at":169},333207,"v1.17.3","本质上就是 #645，它修复了在 OS X 上的编译问题。","2023-06-14T16:34:48",{"id":171,"version":172,"summary_zh":173,"released_at":174},333208,"v1.17.2","修复了#633 中报告的内存泄漏。","2023-04-10T15:01:29",{"id":176,"version":177,"summary_zh":178,"released_at":179},333209,"v1.17.1","添加 `-fpermissive` 以及其他一些小改动","2022-08-08T09:33:59",{"id":181,"version":182,"summary_zh":183,"released_at":184},333210,"v1.17.0","* 由 @novoselrok 实现的多线程构建\n* 对 README 的一些更新，包括添加 Ruby 绑定的链接和整理文档\n* 改进树分裂算法，尝试多次以找到更平衡的分割结果\n* 修复 Windows 下磁盘上的构建问题以及大文件保存问题\n* 清理编译过程中的杂项警告","2020-09-18T16:07:34",{"id":186,"version":187,"summary_zh":188,"released_at":189},333211,"v1.16.3","* 主要修复了处理大文件时的窗口相关问题\n* 同时对部分错误提示进行了优化\n* 更多地使用栈分配而非堆分配\n* 修复了在其他平台上会失败的几个测试用例","2019-12-26T21:14:36",{"id":191,"version":192,"summary_zh":193,"released_at":194},333212,"v1.16.2","1.16.1 版本好像出了点问题，我觉得可能是我很久以前不小心上传的。\n\n直接升级到 1.16.2。\n\n内容与 1.16.1 完全一致。","2019-10-11T17:15:11",{"id":196,"version":197,"summary_zh":198,"released_at":199},333213,"v1.16.1","主要是些小改动，用于捕获用户错误和\u002F或更清晰地说明问题。实际上并没有功能上的变化，除了防止一些可能导致复杂性的可疑用法模式。\n\n#430 – 某些 GCC 版本中的 AVX 问题  \n#427 – 防止保存尚未构建的索引  \n#425 – 防止重复构建已存在的索引  \n#426 – 处理 lseek 调用失败的情况  \n#419 – ftruncate 相关问题  \n#410 – 更新 SWIG  \n#405 – 使用 std::sort  \n\n其余的更改都集中在文档和测试方面，并未改变任何功能。","2019-10-11T17:09:48",{"id":201,"version":202,"summary_zh":80,"released_at":203},333214,"1.16","2019-07-09T02:17:06",{"id":205,"version":206,"summary_zh":207,"released_at":208},333215,"v1.15.2","- 修复 #379 中的保存操作错误处理\n- 修复 Linux 下的编译问题\n- 修复 popcount #366\n- 其他一些杂项改进\n","2019-04-17T01:53:49",{"id":210,"version":211,"summary_zh":80,"released_at":212},333216,"1.15","2019-04-17T01:52:27",{"id":214,"version":215,"summary_zh":216,"released_at":217},333217,"v1.15.1","参见#349","2019-02-22T16:13:00",{"id":219,"version":220,"summary_zh":221,"released_at":222},333218,"v1.15.0","变更：\n- #274 – 磁盘构建\n- Xcode 修复：#334\n- 保存索引到与当前索引相同文件时的 minor 修复：#335","2018-12-10T03:18:48",{"id":224,"version":225,"summary_zh":226,"released_at":227},333219,"v1.14.0","* Fixes to the Euclidean distance function (avoid catastrophic cancellation)\r\n* Don't MAP_POPULATE by default","2018-11-04T19:43:46",{"id":229,"version":230,"summary_zh":231,"released_at":232},333220,"v1.13.0","Main (only?) change is that dot products are now supported thanks to @psobot!","2018-08-31T03:15:32",{"id":234,"version":235,"summary_zh":236,"released_at":237},333221,"v1.12.0","Finally squashed a remaining issue with holes (see #295) and together with a few other holes-related fixes it felt worthy of bumping the minor version number.","2018-05-07T03:11:56",{"id":239,"version":240,"summary_zh":241,"released_at":242},333222,"v1.11.5","Fixes #279 and #261 ","2018-03-31T19:51:55",{"id":244,"version":245,"summary_zh":246,"released_at":247},333223,"v1.11.4","Includes #276 ","2018-02-06T01:39:13",{"id":249,"version":250,"summary_zh":251,"released_at":252},333224,"v1.11.1","This version features AVX instructions thanks to @ReneHollander ","2018-01-20T04:53:02",{"id":254,"version":255,"summary_zh":256,"released_at":257},333225,"v1.10.0","Two big updates\r\n\r\n* Windows support now official – with CI pipeline to prove it. Only tested on Python 3.6 so far\r\n* Experimental Hamming distances","2017-11-13T14:40:35",{"id":259,"version":260,"summary_zh":261,"released_at":262},333226,"v1.9.5","* Fixes a problem with holes in the index reported by @yonromai in #223 ","2017-09-30T15:49:48"]