[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-openucx--ucx":3,"tool-openucx--ucx":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":79,"owner_website":81,"owner_url":82,"languages":83,"stars":122,"forks":123,"last_commit_at":124,"license":125,"difficulty_score":126,"env_os":127,"env_gpu":128,"env_ram":129,"env_deps":130,"category_tags":138,"github_topics":139,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":160,"updated_at":161,"faqs":162,"releases":190},2554,"openucx\u002Fucx","ucx","Unified Communication X  (mailing list - https:\u002F\u002Felist.ornl.gov\u002Fmailman\u002Flistinfo\u002Fucx-group)","UCX（Unified Communication X）是一款专为现代高性能网络设计的通信框架，旨在解决高带宽、低延迟环境下的数据传输瓶颈问题。它通过提供一套抽象的通信原语，智能调度并充分利用底层硬件资源，支持包括 RDMA（InfiniBand 和 RoCE）、TCP、GPU 直接通信、共享内存及网络原子操作等多种传输方式。\n\n这款工具主要服务于需要极致通信性能的开发者、系统架构师及科研人员，特别是那些从事大规模并行计算、深度学习训练或构建高性能集群的用户。UCX 的核心亮点在于其“硬件感知”能力：它能自动识别当前环境可用的最佳硬件加速路径，将数据搬运任务卸载到专用硬件上，从而显著降低 CPU 负载并提升整体吞吐量。作为多个获奖项目的基础组件，UCX 已被广泛集成于 OpenMPI、OpenSHMEM 等主流并行编程接口中，是打造高效能计算系统的坚实基石。无论是进行底层驱动开发还是优化分布式应用，UCX 都能提供生产级验证的稳定支持。","\u003Cdiv align=\"center\">\n  \u003Ca href=\"http:\u002F\u002Fwww.openucx.org\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopenucx_ucx_readme_dac04607d4b0.png\" width=\"200\">\u003C\u002Fa>\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Ftwitter.com\u002Fintent\u002Ffollow?screen_name=openucx\"> \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002Fopenucx?style=social&logo=twitter\" alt=\"follow on Twitter\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fopenucx.github.io\u002Fucx\u002Fapi\u002Flatest\u002Fhtml\u002F\">\u003Cimg src=\"docs\u002Fdoxygen\u002Fapi.svg\">\u003C\u002Fa>\n  \u003Ca href='https:\u002F\u002Fopenucx.readthedocs.io\u002Fen\u002Fmaster\u002F?badge=master'>\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopenucx_ucx_readme_13d664e1afd7.png' alt='Documentation Status' \u002F>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Freleases\u002Flatest\">\u003Cimg src=\"docs\u002Fdoxygen\u002Frelease.svg\">\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Chr>\n\n# Unified Communication X\n\nUnified Communication X (UCX) is an\n[award winning](https:\u002F\u002Flosalamosreporter.com\u002F2019\u002F11\u002F07\u002Fnine-los-alamos-national-laboratory-projects-win-rd-100-awards),\noptimized production proven-communication framework for modern, high-bandwidth\nand low-latency networks.\n\nUCX exposes a set of abstract communication primitives that utilize the best of\navailable hardware resources and offloads. These include RDMA (InfiniBand and RoCE),\nTCP, GPUs, shared memory, and network atomic operations.\n\nPlease visit our [documentation site](https:\u002F\u002Fopenucx.readthedocs.io\u002Fen\u002Fmaster)\n for more details.\n\nPlease review our [\"Membership Voluntary Consensus Standard\"](https:\u002F\u002Fucfconsortium.org\u002Fpolicy\u002F)  and\n[\"Export Compliant Contribution Submissions\"](https:\u002F\u002Fucfconsortium.org\u002Fpolicy\u002F) policies.\n\n\u003Chr>\n\n\u003C!-- TOC generated by https:\u002F\u002Fgithub.com\u002Fekalinin\u002Fgithub-markdown-toc -->\n\n* [Using UCX](#using-ucx)\n* [Known issues](#known-issues)\n* [Architecture](#architecture)\n* [Supported Transports](#supported-transports)\n* [Supported CPU Architectures](#supported-cpu-architectures)\n* [Licenses](#licenses)\n* [Our Community](#our-community)\n* [Contributor Agreement and Guidelines](#contributor-agreement-and-guidelines)\n* [Publications](#publications)\n\n\n\u003Chr>\n\n## Using UCX\n\n### Release Builds\n\nBuilding UCX is typically a combination of running \"configure\" and \"make\".\nIf using a release tarball execute the following commands to install the\nUCX system from within the directory at the top of the tree:\n\n```sh\n$ .\u002Fcontrib\u002Fconfigure-release --prefix=\u002Fwhere\u002Fto\u002Finstall\n$ make -j8\n$ make install\n```\n\nIf directly cloning the git repository use:\n\n```sh\n$ .\u002Fautogen.sh\n$ .\u002Fcontrib\u002Fconfigure-release --prefix=\u002Fwhere\u002Fto\u002Finstall\n$ make -j8\n$ make install\n```\n\nNOTE: Compiling support for various networks or other specific hardware may\nrequire additional command line flags when running configure.\n\n### Developer Builds\n\n```bash\n$ .\u002Fautogen.sh\n$ .\u002Fcontrib\u002Fconfigure-devel --prefix=$PWD\u002Finstall-debug\n```\n\n*** NOTE: Developer builds of UCX typically include a large performance\npenalty at run-time because of extra debugging code.\n\n### Build RPM package\n```bash\n$ contrib\u002Fbuildrpm.sh -s -b\n```\n\n### Build DEB package\n```bash\n$ dpkg-buildpackage -us -uc\n```\n\n### Build Doxygen documentation\n```bash\n$ make docs\n```\n\n### OpenMPI and OpenSHMEM installation with UCX\n[Wiki page](http:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Fwiki\u002FOpenMPI-and-OpenSHMEM-installation-with-UCX)\n\n### MPICH installation with UCX\n[Wiki page](http:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Fwiki\u002FMPICH-installation-with-UCX)\n\n### UCX Performance Test\n\nStart server:\n\n```sh\n$ .\u002Fsrc\u002Ftools\u002Fperf\u002Fucx_perftest -c 0\n```\nConnect client:\n\n```sh\n$ .\u002Fsrc\u002Ftools\u002Fperf\u002Fucx_perftest \u003Cserver-hostname> -t tag_lat -c 1\n```\n> NOTE the `-c` flag sets CPU affinity. If running both >commands on same host, make sure you set the affinity to different CPU cores.\n\n\n### Running internal unit tests\n\n```sh\n$ make -C test\u002Fgtest test\n```\n\n\u003Chr>\n\n\n## Known issues\n* UCX version 1.8.0 has a bug that may cause data corruption when TCP transport\n  is used in conjunction with shared memory transport. It is advised to upgrade\n  to UCX version 1.9.0 and above. UCX versions released before 1.8.0 don't have\n  this bug.\n\n* UCX may hang with glibc versions 2.25-2.29 due to known bugs in the\n  pthread_rwlock functions. When such hangs occur, one of the UCX threads gets\n  stuck in pthread_rwlock_rdlock (which is called by ucs_rcache_get), even\n  though no other thread holds the lock. A related issue is reported in\n  [glibc Bug 23844](https:\u002F\u002Fsourceware.org\u002Fbugzilla\u002Fshow_bug.cgi?id=23844).\n  If this issue occurs, it is advised to use glibc version provided with your\n  OS distribution or build glibc from source using versions less than 2.25 or\n  greater than 2.29.\n\n* Due to compatibility flaw when using UCX with rdma-core v22 setting\n  UCX_DC_MLX5_RX_INLINE=0 is unsupported and will make DC transport unavailable.\n  This issue is fixed in rdma-core v24 and backported to rdma-core-22.4-2.el7 rpm.\n  See [ucx issue 5749](https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Fissues\u002F5749) for more\n  details.\n\n* If network routing is incorrectly recognized, leading to peers reported as\n  \"unreachable\", change the default reachability check mode by setting:\n  UCX_IB_ROCE_REACHABILITY_MODE=all.\n\n\u003Chr>\n\n\n## Architecture\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopenucx_ucx_readme_34e13039266b.png)\n\n| Component | Role        | Description |\n| :---:     | :---:       | ---         |\n| UCP | Protocol          | Implements high-level abstractions such as tag-matching, streams, connection negotiation and establishment, multi-rail, and handling different memory types |\n| UCT | Transport         | Implements low-level communication primitives such as active messages, remote memory access, and atomic operations |\n| UCS | Services          | A collection of data structures, algorithms, and system utilities for common use |\n| UCM | Memory            | Intercepts memory allocation and release events, used by the  memory registration cache |\n\n\u003Chr>\n\n## Supported Transports\n\n* [Infiniband](https:\u002F\u002Fwww.infinibandta.org\u002F)\n* [Omni-Path](https:\u002F\u002Fwww.intel.com\u002Fcontent\u002Fwww\u002Fus\u002Fen\u002Fhigh-performance-computing-fabrics\u002Fomni-path-driving-exascale-computing.html)\n* [RoCE](http:\u002F\u002Fwww.roceinitiative.org\u002F)\n* [Cray Gemini and Aries](https:\u002F\u002Fwww.cray.com\u002F)\n* [CUDA](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-zone)\n* [ROCm](https:\u002F\u002Frocm.docs.amd.com\u002F)\n* Shared Memory\n    * posix, sysv, [cma](https:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=2616532), [knem](http:\u002F\u002Fknem.gforge.inria.fr\u002F), and [xpmem](https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fxpmem)\n* TCP\u002FIP\n\n**NOTE:** UCX >= 1.12.0 requires rdma-core >= 28.0 or MLNX_OFED >= 5.0 for [Infiniband](https:\u002F\u002Fwww.infinibandta.org\u002F) and [RoCE](http:\u002F\u002Fwww.roceinitiative.org\u002F) transports support.\n\u003Chr>\n\n## Supported CPU Architectures\n\n* [x86_64](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FX86-64)\n* [Power8\u002F9](https:\u002F\u002Fwww.ibm.com\u002Fsupport\u002Fknowledgecenter\u002Fen\u002FPOWER9\u002Fp9hdx\u002FPOWER9welcome.htm)\n* [Arm v8](https:\u002F\u002Fwww.arm.com\u002Fproducts\u002Fsilicon-ip-cpu)\n\n\u003Chr>\n\n## Licenses\n\nUCX is licensed as:\n\n* [BSD3](LICENSE)\n\n\u003Chr>\n\n## Our Community\n\n* [Project Website](http:\u002F\u002Fwww.openucx.org\u002F)\n* [ReadTheDocs](https:\u002F\u002Fopenucx.readthedocs.io\u002Fen\u002Fmaster\u002F)\n* [Github](http:\u002F\u002Fwww.github.com\u002Fopenucx\u002Fucx\u002F)\n* [Software Releases](http:\u002F\u002Fwww.github.com\u002Fopenucx\u002Fucx\u002Freleases)\n* [Mailing List](https:\u002F\u002Felist.ornl.gov\u002Fmailman\u002Flistinfo\u002Fucx-group)\n* [Twitter](https:\u002F\u002Ftwitter.com\u002Fopenucx)\n\n\u003Chr>\n\n## Contributor Agreement and Guidelines\n\nIn order to contribute to UCX, please sign up with an appropriate\n[Contributor Agreement](http:\u002F\u002Fwww.openucx.org\u002Flicense\u002F).\n\nAll contributors have to comply with [\"Membership Voluntary\nConsensus Standard\"](https:\u002F\u002Fucfconsortium.org\u002Fpolicy\u002F)  and [\"Export Compliant\nContribution Submissions\"](https:\u002F\u002Fucfconsortium.org\u002Fpolicy\u002F) policies.\n\nFollow these\n[instructions](https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Fwiki\u002FGuidance-for-contributors)\nwhen submitting contributions and changes.\n\n## Publications\n\nTo reference UCX in a publication, please use the following entry:\n\n```bibtex\n@inproceedings{shamis2015ucx,\n  title={UCX: an open source framework for HPC network APIs and beyond},\n  author={Shamis, Pavel and Venkata, Manjunath Gorentla and Lopez, M Graham and Baker, Matthew B and Hernandez, Oscar and Itigin, Yossi and Dubman, Mike and Shainer, Gilad and Graham, Richard L and Liss, Liran and others},\n  booktitle={2015 IEEE 23rd Annual Symposium on High-Performance Interconnects},\n  pages={40--43},\n  year={2015},\n  organization={IEEE}\n}\n```\n\nTo reference the UCX website:\n\n```bibtex\n@misc{openucx-website,\n    title = {{The Unified Communication X Library}},\n    key = {{{The Unified Communication X Library}},\n    howpublished = {{\\url{http:\u002F\u002Fwww.openucx.org}}}\n}\n```\n","\u003Cdiv align=\"center\">\n  \u003Ca href=\"http:\u002F\u002Fwww.openucx.org\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopenucx_ucx_readme_dac04607d4b0.png\" width=\"200\">\u003C\u002Fa>\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Ftwitter.com\u002Fintent\u002Ffollow?screen_name=openucx\"> \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002Fopenucx?style=social&logo=twitter\" alt=\"关注Twitter\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fopenucx.github.io\u002Fucx\u002Fapi\u002Flatest\u002Fhtml\u002F\">\u003Cimg src=\"docs\u002Fdoxygen\u002Fapi.svg\">\u003C\u002Fa>\n  \u003Ca href='https:\u002F\u002Fopenucx.readthedocs.io\u002Fen\u002Fmaster\u002F?badge=master'>\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopenucx_ucx_readme_13d664e1afd7.png' alt='文档状态' \u002F>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Freleases\u002Flatest\">\u003Cimg src=\"docs\u002Fdoxygen\u002Frelease.svg\">\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Chr>\n\n# 统一通信X\n\n统一通信X（UCX）是一个屡获殊荣的、经过优化且在生产环境中得到验证的通信框架，专为现代高带宽、低延迟网络而设计。它曾荣获[RD-100奖](https:\u002F\u002Flosalamosreporter.com\u002F2019\u002F11\u002F07\u002Fnine-los-alamos-national-laboratory-projects-win-rd-100-awards)。\n\nUCX提供了一组抽象的通信原语，能够充分利用现有的硬件资源和卸载技术，包括RDMA（InfiniBand和RoCE）、TCP、GPU、共享内存以及网络原子操作等。\n\n更多详细信息，请访问我们的[文档网站](https:\u002F\u002Fopenucx.readthedocs.io\u002Fen\u002Fmaster)。\n\n请查阅我们的[\"成员自愿共识标准\"](https:\u002F\u002Fucfconsortium.org\u002Fpolicy\u002F)和[\"符合出口管制的贡献提交\"]政策(https:\u002F\u002Fucfconsortium.org\u002Fpolicy\u002F)。\n\n\u003Chr>\n\n\u003C!-- TOC generated by https:\u002F\u002Fgithub.com\u002Fekalinin\u002Fgithub-markdown-toc -->\n\n* [使用UCX](#using-ucx)\n* [已知问题](#known-issues)\n* [架构](#architecture)\n* [支持的传输协议](#supported-transports)\n* [支持的CPU架构](#supported-cpu-architectures)\n* [许可证](#licenses)\n* [我们的社区](#our-community)\n* [贡献者协议与指南](#contributor-agreement-and-guidelines)\n* [出版物](#publications)\n\n\n\u003Chr>\n\n## 使用UCX\n\n### 发布版构建\n\n通常，构建UCX需要先运行“configure”脚本，再执行“make”。如果使用的是发布版tarball，可在树形结构的根目录下执行以下命令来安装UCX系统：\n\n```sh\n$ .\u002Fcontrib\u002Fconfigure-release --prefix=\u002Fwhere\u002Fto\u002Finstall\n$ make -j8\n$ make install\n```\n\n如果是直接从Git仓库克隆，则应使用：\n\n```sh\n$ .\u002Fautogen.sh\n$ .\u002Fcontrib\u002Fconfigure-release --prefix=\u002Fwhere\u002Fto\u002Finstall\n$ make -j8\n$ make install\n```\n\n注意：编译支持多种网络或其他特定硬件时，可能需要在运行configure时添加额外的命令行参数。\n\n### 开发版构建\n\n```bash\n$ .\u002Fautogen.sh\n$ .\u002Fcontrib\u002Fconfigure-devel --prefix=$PWD\u002Finstall-debug\n```\n\n*** 注意：由于包含额外的调试代码，UCX的开发版在运行时通常会带来较大的性能损失。\n\n### 构建RPM包\n```bash\n$ contrib\u002Fbuildrpm.sh -s -b\n```\n\n### 构建DEB包\n```bash\n$ dpkg-buildpackage -us -uc\n```\n\n### 构建Doxygen文档\n```bash\n$ make docs\n```\n\n### 使用UCX安装OpenMPI和OpenSHMEM\n[维基页面](http:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Fwiki\u002FOpenMPI-and-OpenSHMEM-installation-with-UCX)\n\n### 使用UCX安装MPICH\n[维基页面](http:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Fwiki\u002FMPICH-installation-with-UCX)\n\n### UCX性能测试\n\n启动服务器：\n\n```sh\n$ .\u002Fsrc\u002Ftools\u002Fperf\u002Fucx_perftest -c 0\n```\n连接客户端：\n\n```sh\n$ .\u002Fsrc\u002Ftools\u002Fperf\u002Fucx_perftest \u003Cserver-hostname> -t tag_lat -c 1\n```\n> 注意 `-c` 标志用于设置CPU亲和性。如果在同一台主机上同时运行这两个命令，请确保将亲和性分别设置到不同的CPU核心上。\n\n\n### 运行内部单元测试\n\n```sh\n$ make -C test\u002Fgtest test\n```\n\n\u003Chr>\n\n\n## 已知问题\n* UCX 1.8.0版本存在一个可能导致数据损坏的bug，当TCP传输与共享内存传输结合使用时会出现此问题。建议升级到UCX 1.9.0或更高版本。1.8.0之前的版本不存在此问题。\n\n* 由于pthread_rwlock函数中的已知缺陷，UCX在glibc 2.25至2.29版本中可能会挂起。当发生此类挂起时，其中一个UCX线程会卡在pthread_rwlock_rdlock（由ucs_rcache_get调用）中，即使没有其他线程持有锁。相关问题已在[glibc Bug 23844](https:\u002F\u002Fsourceware.org\u002Fbugzilla\u002Fshow_bug.cgi?id=23844)中报告。如果遇到此问题，建议使用操作系统自带的glibc版本，或从源码编译低于2.25或高于2.29的glibc版本。\n\n* 由于兼容性问题，当UCX与rdma-core v22一起使用时，设置UCX_DC_MLX5_RX_INLINE=0是不被支持的，并会导致DC传输不可用。此问题已在rdma-core v24中修复，并后向移植到了rdma-core-22.4-2.el7 rpm中。更多信息请参见[ucx issue 5749](https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Fissues\u002F5749)。\n\n* 如果网络路由识别错误，导致对等节点显示为“不可达”，可以通过设置UCX_IB_ROCE_REACHABILITY_MODE=all来更改默认的可达性检查模式。\n\n\u003Chr>\n\n\n## 架构\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopenucx_ucx_readme_34e13039266b.png)\n\n| 组件 | 职责        | 描述 |\n| :---:     | :---:       | ---         |\n| UCP | 协议层          | 实现高级抽象，如标签匹配、流、连接协商与建立、多路径传输以及不同内存类型的处理 |\n| UCT | 传输层         | 实现低层通信原语，如主动消息、远程内存访问和原子操作 |\n| UCS | 服务层          | 提供常用的数据结构、算法和系统工具 |\n| UCM | 内存管理层            | 拦截内存分配和释放事件，用于内存注册缓存 |\n\n\u003Chr>\n\n## 支持的传输协议\n\n* [Infiniband](https:\u002F\u002Fwww.infinibandta.org\u002F)\n* [Omni-Path](https:\u002F\u002Fwww.intel.com\u002Fcontent\u002Fwww\u002Fus\u002Fen\u002Fhigh-performance-computing-fabrics\u002Fomni-path-driving-exascale-computing.html)\n* [RoCE](http:\u002F\u002Fwww.roceinitiative.org\u002F)\n* [Cray Gemini和Aries](https:\u002F\u002Fwww.cray.com\u002F)\n* [CUDA](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-zone)\n* [ROCm](https:\u002F\u002Frocm.docs.amd.com\u002F)\n* 共享内存\n    * posix、sysv、[cma](https:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=2616532)、[knem](http:\u002F\u002Fknem.gforge.inria.fr\u002F)和[xpmem](https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fxpmem)\n* TCP\u002FIP\n\n**注：** UCX >= 1.12.0要求rdma-core >= 28.0或MLNX_OFED >= 5.0才能支持[Infiniband](https:\u002F\u002Fwww.infinibandta.org\u002F)和[RoCE](http:\u002F\u002Fwww.roceinitiative.org\u002F)传输。\n\u003Chr>\n\n## 支持的CPU架构\n\n* [x86_64](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FX86-64)\n* [Power8\u002F9](https:\u002F\u002Fwww.ibm.com\u002Fsupport\u002Fknowledgecenter\u002Fen\u002FPOWER9\u002Fp9hdx\u002FPOWER9welcome.htm)\n* [Arm v8](https:\u002F\u002Fwww.arm.com\u002Fproducts\u002Fsilicon-ip-cpu)\n\n\u003Chr>\n\n## 许可证\n\nUCX采用以下许可证：\n\n* [BSD3](LICENSE)\n\n\u003Chr>\n\n## 我们的社区\n\n* [项目官网](http:\u002F\u002Fwww.openucx.org\u002F)\n* [ReadTheDocs](https:\u002F\u002Fopenucx.readthedocs.io\u002Fen\u002Fmaster\u002F)\n* [Github](http:\u002F\u002Fwww.github.com\u002Fopenucx\u002Fucx\u002F)\n* [软件发布](http:\u002F\u002Fwww.github.com\u002Fopenucx\u002Fucx\u002Freleases)\n* [邮件列表](https:\u002F\u002Felist.ornl.gov\u002Fmailman\u002Flistinfo\u002Fucx-group)\n* [Twitter](https:\u002F\u002Ftwitter.com\u002Fopenucx)\n\n\u003Chr>\n\n## 贡献者协议与指南\n\n为了向 UCX 作出贡献，请签署相应的\n[贡献者协议](http:\u002F\u002Fwww.openucx.org\u002Flicense\u002F)。\n\n所有贡献者必须遵守“成员自愿共识标准”和“符合出口管制的贡献提交”政策。\n\n提交贡献和更改时，请遵循这些\n[说明](https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Fwiki\u002FGuidance-for-contributors)。\n\n## 出版物\n\n若在出版物中引用 UCX，请使用以下条目：\n\n```bibtex\n@inproceedings{shamis2015ucx,\n  title={UCX: an open source framework for HPC network APIs and beyond},\n  author={Shamis, Pavel and Venkata, Manjunath Gorentla and Lopez, M Graham and Baker, Matthew B and Hernandez, Oscar and Itigin, Yossi and Dubman, Mike and Shainer, Gilad and Graham, Richard L and Liss, Liran and others},\n  booktitle={2015 IEEE 23rd Annual Symposium on High-Performance Interconnects},\n  pages={40--43},\n  year={2015},\n  organization={IEEE}\n}\n```\n\n若引用 UCX 官网，请使用以下条目：\n\n```bibtex\n@misc{openucx-website,\n    title = {{统一通信 X 库}},\n    key = {{{统一通信 X 库}}},\n    howpublished = {{\\url{http:\u002F\u002Fwww.openucx.org}}}\n}\n```","# UCX 快速上手指南\n\nUnified Communication X (UCX) 是一个专为现代高带宽、低延迟网络优化的通信框架，广泛应用于高性能计算（HPC）领域。它支持 RDMA (InfiniBand\u002FRoCE)、TCP、GPU 直接通信等多种传输方式。\n\n## 1. 环境准备\n\n### 系统要求\n*   **操作系统**: Linux (推荐 CentOS, RHEL, Ubuntu 等主流发行版)\n*   **CPU 架构**: x86_64, Power8\u002F9, Arm v8\n*   **关键依赖**:\n    *   若使用 InfiniBand 或 RoCE，需安装 `rdma-core` (版本 >= 28.0) 或 `MLNX_OFED` (版本 >= 5.0)。\n    *   基础编译工具：`gcc`, `make`, `autoconf`, `automake`, `libtool`, `binutils-devel`。\n    *   可选依赖：`cuda-devel` (GPU 支持), `knem` (共享内存加速)。\n\n> **提示**：国内用户若使用 Mellanox 网卡，建议通过厂商提供的国内镜像源安装 MLNX_OFED 驱动以获得最佳兼容性。\n\n## 2. 安装步骤\n\n您可以选择从源码编译安装，这是最灵活的方式。\n\n### 方案 A：从 Release 包安装（推荐稳定版）\n如果您已下载 release tarball (`.tar.gz`)：\n\n```bash\n# 解压并进入目录\ntar -xzf ucx-\u003Cversion>.tar.gz\ncd ucx-\u003Cversion>\n\n# 配置、编译并安装\n# 请将 \u002Fwhere\u002Fto\u002Finstall 替换为您期望的安装路径\n.\u002Fcontrib\u002Fconfigure-release --prefix=\u002Fwhere\u002Fto\u002Finstall\nmake -j8\nmake install\n```\n\n### 方案 B：从 Git 仓库克隆（开发版或最新代码）\n```bash\n# 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx.git\ncd ucx\n\n# 生成构建脚本\n.\u002Fautogen.sh\n\n# 配置、编译并安装\n.\u002Fcontrib\u002Fconfigure-release --prefix=\u002Fwhere\u002Fto\u002Finstall\nmake -j8\nmake install\n```\n\n> **注意**：\n> *   编译特定硬件支持（如特定网卡或 GPU）可能需要在 `configure` 阶段添加额外的命令行标志。\n> *   安装完成后，请记得将 `$INSTALL_PATH\u002Flib` 添加到 `LD_LIBRARY_PATH` 环境变量中。\n\n## 3. 基本使用\n\nUCX 自带性能测试工具 `ucx_perftest`，可用于验证安装是否成功及测试网络性能。以下示例演示如何在单机上启动服务器和客户端进行延迟测试。\n\n### 启动服务端\n在终端 A 中运行（`-c 0` 表示绑定到 CPU 核心 0）：\n\n```bash\n.\u002Fsrc\u002Ftools\u002Fperf\u002Fucx_perftest -c 0\n```\n\n### 启动客户端\n在终端 B 中运行，连接到服务端主机（将 `\u003Cserver-hostname>` 替换为实际 IP 或主机名，`-c 1` 绑定到不同核心以避免资源竞争）：\n\n```bash\n.\u002Fsrc\u002Ftools\u002Fperf\u002Fucx_perftest \u003Cserver-hostname> -t tag_lat -c 1\n```\n\n运行成功后，客户端将输出延迟测试结果。若需与 OpenMPI 或 MPICH 集成，请在配置这些 MPI 库时指定 `--with-ucx=\u003Cinstall_path>`。","某超算中心的研究团队正在运行大规模气候模拟任务，需要在数千个 GPU 节点间频繁交换海量网格数据。\n\n### 没有 ucx 时\n- 通信框架无法自动识别底层硬件，导致高速 InfiniBand 网络被降级为普通 TCP 传输，带宽利用率不足 20%。\n- CPU 在数据拷贝过程中过度参与，大量算力被消耗在内存搬运上，导致核心计算任务排队等待。\n- 不同节点间的显存数据交互必须经过主机内存中转，增加了额外的延迟和上下文切换开销。\n- 面对复杂的混合网络环境，开发人员需手动编写大量适配代码来切换传输协议，维护成本极高。\n\n### 使用 ucx 后\n- ucx 自动探测并优先调用 RDMA（InfiniBand\u002FRoCE）硬件卸载能力，将网络带宽利用率提升至 90% 以上。\n- 通过零拷贝技术和共享内存机制，ucx 让 GPU 显存直接通信，彻底释放了 CPU 资源用于科学计算。\n- 统一的抽象接口屏蔽了底层网络差异，同一套代码即可在 TCP、RoCE 或共享内存环境中高效运行。\n- 内置的原子操作和流控优化显著降低了微秒级延迟，使万核规模下的并行效率大幅提升。\n\nucx 通过智能调度硬件资源，将原本受限于通信瓶颈的超算集群转化为真正的高性能计算引擎。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopenucx_ucx_dac04607.png","openucx","UCX","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fopenucx_29d6fc14.png","Unified Communication X",null,"ucx-group@elist.ornl.gov","http:\u002F\u002Fwww.openucx.org","https:\u002F\u002Fgithub.com\u002Fopenucx",[84,88,92,96,99,103,107,111,115,118],{"name":85,"color":86,"percentage":87},"C","#555555",61.9,{"name":89,"color":90,"percentage":91},"C++","#f34b7d",31.9,{"name":93,"color":94,"percentage":95},"Java","#b07219",1.4,{"name":97,"color":79,"percentage":98},"M4",1.2,{"name":100,"color":101,"percentage":102},"Shell","#89e051",1.1,{"name":104,"color":105,"percentage":106},"Go","#00ADD8",0.7,{"name":108,"color":109,"percentage":110},"Makefile","#427819",0.6,{"name":112,"color":113,"percentage":114},"Cuda","#3A4E3A",0.5,{"name":116,"color":117,"percentage":114},"Lua","#000080",{"name":119,"color":120,"percentage":121},"Python","#3572A5",0.2,1603,538,"2026-04-01T15:43:49","NOASSERTION",4,"Linux","非必需。支持 NVIDIA CUDA 和 AMD ROCm GPU 用于加速通信，具体型号和显存未说明。","未说明",{"notes":131,"python":132,"dependencies":133},"1. 主要面向高性能计算（HPC）环境，支持 InfiniBand, RoCE, Omni-Path, Cray Gemini\u002FAries 等高速网络。2. 已知问题：若使用 glibc 2.25-2.29 版本可能导致程序死锁，建议升级或降级 glibc。3. UCX 1.8.0 版本在 TCP 与共享内存混用时存在数据损坏 bug，建议使用 1.9.0 及以上版本。4. 构建时可能需要额外的命令行标志以启用特定硬件支持。","未说明 (该工具为 C\u002FC++ 库，非 Python 包)",[134,135,136,137],"rdma-core >= 28.0 (针对 InfiniBand\u002FRoCE, UCX >= 1.12.0)","MLNX_OFED >= 5.0 (针对 InfiniBand\u002FRoCE, UCX >= 1.12.0)","glibc (需避免 2.25-2.29 版本以防死锁)","可选：KNEM, XPMEM (用于共享内存传输)",[53,55,14,26,15,13,54,51],[140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159],"rdma","hpc","networking","infiniband","cray","shared-memory","mpi","shmem","c","c-plus-plus","verbs","pgas","roce","iwarp","gemini","aries","drivers","openshmem","tcp-ip","hacktoberfest","2026-03-27T02:49:30.150509","2026-04-06T05:32:26.856748",[163,168,173,178,182,186],{"id":164,"question_zh":165,"answer_zh":166,"source_url":167},11803,"在容器化环境（如 Apptainer）中使用 OpenMPI 和 UCX 时，遇到 'process_vm_readv returned -1: Operation not permitted' 错误导致节点内通信失败，如何解决？","该问题通常是由于容器内的用户命名空间（user namespaces）限制或权限不足导致的。建议采取以下步骤：\n1. 检查系统管理员是否启用了用户命名空间。\n2. 尝试安装 `apptainer-suid` 版本，或在编译 Apptainer 时使用 `.\u002Fmconfig --with-suid` 选项。\n3. 如果是在 Rocky Linux 8.x 等内核版本较旧的系统上，可能会遇到用户命名空间泄漏问题（达到 `max_user_namespaces` 限制），此时需要清理未正确关闭的命名空间或升级内核。\n4. 建议在 Apptainer 仓库提交相关问题以获取针对容器环境的特定支持。","https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Fissues\u002F8958",{"id":169,"question_zh":170,"answer_zh":171,"source_url":172},11804,"运行 Docker 容器时遇到 'UCX ERROR mlx5dv_devx_create_event_channel() failed: Protocol not supported' 错误，原因是什么？","此错误表明当前的硬件驱动或内核版本不支持 UCX 尝试使用的 DevX (Direct Verbs) 功能。这通常发生在较旧的 Mellanox OFED 驱动或内核上。解决方法包括：\n1. 升级 Mellanox OFED 驱动到支持 DevX 的版本。\n2. 如果无法升级驱动，可以通过设置环境变量禁用 DevX 相关功能，例如：`export UCX_IB_DEVX=n` 或 `export UCX_TLS=^dc,rc` (具体取决于网络后端需求)，强制 UCX 回退到传统的 Verbs 接口。","https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Fissues\u002F8440",{"id":174,"question_zh":175,"answer_zh":176,"source_url":177},11805,"使用 UCX 进行非阻塞广播（MPI_Ibcast）时，发现需要调用极多次数的 ucp_worker_progress 才能完成，导致性能下降，可能的原因是什么？","这可能是由于 UCX 将大消息分割成了多个短片段（short segments），并且每个片段的推进都需要调用 `ucp_worker_progress`。此外，如果使用了 Active Messages 协议且配置不当，可能会导致额外的开销。建议检查:\n1. 确认是否触发了特定的分段阈值（如 65536 字节）。\n2. 检查是否有连接建立（wireup）阶段的延迟或未完成的请求（如 `ucp_get_nb` 回调未触发）。\n3. 对比 PAMI 等其他后端的行为，确认是否为协议实现差异。若遇到挂起，需检查 trace 日志中是否有请求分配成功但回调从未执行的情况。","https:\u002F\u002Fgithub.com\u002Fopenucx\u002Fucx\u002Fissues\u002F4036",{"id":179,"question_zh":180,"answer_zh":181,"source_url":172},11806,"如何确认 OpenMPI 是否正确编译并链接了 UCX 支持？","可以使用以下命令检查 OpenMPI 的配置信息：\n`ompi_info -a | grep ucx`\n如果输出中没有显示 UCX 相关的组件或路径，说明 OpenMPI 可能未编译 UCX 支持，或者运行时无法找到 UCX 库。此外，如果在运行时指定 `--mca pml ucx` 报错提示组件未找到，也印证了这一点。需要重新编译 OpenMPI 并确保 `--with-ucx` 指向正确的安装路径。",{"id":183,"question_zh":184,"answer_zh":185,"source_url":177},11807,"在调试 UCX 问题时，如何强制立即刷新跟踪日志（trace），以便在程序挂起时看到最后的输出？","默认情况下，UCX 的跟踪日志可能会被缓冲，导致程序挂起时无法看到最近的日志。虽然 UCX 本身可能没有直接的环境变量来强制每行刷新，但在开发或调试构建中，可以在源码层面添加 `fflush` 调用。对于用户而言，可以尝试设置 `UCX_LOG_LEVEL` 为更高优先级以减少缓冲影响，或者在运行脚本中通过 `stdbuf -oL` 等工具强制标准输出行缓冲刷新。如果在源码中调试，需在写入 trace 的行之后显式调用 `fflush(stdout)` 或对应的文件流刷新函数。",{"id":187,"question_zh":188,"answer_zh":189,"source_url":172},11808,"OpenMPI 运行时报告 'No OpenFabrics connection schemes reported that they were able to be used' 且 CPCs (rdmacm, udcm) 尝试失败，该如何排查？","这表示 OpenFabrics (IB\u002FRoCE) 的连接管理组件无法在指定端口上建立连接。排查步骤如下：\n1. 检查本地设备（如 mlx5_0）和端口状态是否正常 (`ibstat`, `ibv_devinfo`)。\n2. 确认防火墙或安全组规则未阻挡 RDMA 通信端口。\n3. 检查是否加载了正确的内核模块（如 `rdma_cm`, `ib_ucm`）。\n4. 尝试手动指定传输层，例如设置 `UCX_TLS=rc,sm,cuda_copy,cuda_ipc` 排除有问题的传输。\n5. 确保 OpenMPI 和 UCX 版本兼容，且 UCX 能正确识别底层硬件。",[191,196,201,206,211,216,221,226,231,236,241,246,251,256,261,266,271,276,281,286],{"id":192,"version":193,"summary_zh":194,"released_at":195},62267,"v1.20.1-rc1","## 1.20.1-rc1（2026年3月18日）\n### 功能特性：\n#### RDMA 核心（IB、ROCE 等）\n * 为 UCX_IB_MLX5_DEVX_OBJECTS 添加了 ‘auto’ 选项，当 ODP 可用时禁用 DevX（适用于 Grace 版本）\n * 优先使用子网掩码更长的路由，以提高可达性检查的准确性\n#### 文档\n * 澄清了在调用 ucp_atomic_op_nbx 后仍可修改用户缓冲区\n### 错误修复：\n#### UCP\n * 增加了传输选择中的 TLS 信息缓冲区大小，以防止潜在的截断问题\n * 修复了关于有效环境变量名称的错误警告\n * 修复了 ucp_config_modify 在找不到匹配的可修改配置时未报告错误的问题\n#### RDMA 核心（IB、ROCE 等）\n * 修复了 DevX 对象标志的处理逻辑\n * 修复了 MLX5 DevX 中设备内存分配的对齐问题\n * 修复了 IB 内存句柄标志枚举的顺序\n * 禁用了 Direct NIC 的间接原子注册\n * 修复了 UD 传输中连接重置前遗留的目标端点 ID 和确认消息问题\n * 修复了当 node_guuid 在多个 HCA 中不唯一时 RoCE 可达性路由检查的问题\n#### CUDA\n * 修复了 rkey 解包过程中对系统设备的 CUDA 上下文处理问题\n#### ROCM\n * 修复了较新 ROCm 版本中 HSA 内存类型检查的问题\n#### UCS\n * 修复了 GDR 复制的 rcache 加锁问题\n#### 打包\n * 修复了 ucx-cuda Debian 软件包依赖项中误移除 libnvidia-compute 的问题，该问题会导致现有安装损坏\n * 弃用了 KNEM 子软件包","2026-03-22T12:07:30",{"id":197,"version":198,"summary_zh":199,"released_at":200},62268,"v1.20.0-rc1","## 1.20.0-rc1\n### 功能特性：\n#### UCP\n * 新增 GPU 设备 API，用于实现 GPU 到 GPU 的直接通信\n * 新增主机端 API，用于 GPU 设备管理\n * 新增设备信号量 API，支持协作级别和标志位设置\n * 新增在设备操作中处理偏移量和通道 ID 的 API\n * 新增在设备操作中写入本地计数器的方法\n * 在设备 API 的内存列表元素中新增本地地址和远程地址字段\n * 新增设备通道选择及已分配句柄填充功能\n * 新增对使用 CUDA 的 Direct NIC（DPU）数据路径的支持\n * 新增对 Direct NIC 的 rkey 打包支持\n * 新增当内存的 sys_dev 与远程通道的 sys_dev 不同时的发送方刷新机制\n * 新增每种协议可使用单个网络设备的选项\n * 新增 MIN_RMA_CHUNK_SIZE 配置参数\n * 将 MIN_RMA_CHUNK_SIZE 的默认值由 16k 降低至 8k\n * 通过 find_lanes 回调改进协议通道选择，以最小化开销\n * 改进快速完成情况下的 send-zcopy 延迟因子\n * 改进多 ppn 性能估算\n * 移除已弃用的 ucp_mem 函数\n * 弃用 ucp_request_alloc API\n#### UCT\n * 新增用于 GPU 通信的新设备 API（rc_gda 传输）\n * 新增 GDAKI 传输，并支持将端点导出到 GPU\n * 新增对外部内存的 DEVX QP\u002FCQ 支持\n * 新增 CUDA_IPC 传输的设备 API 实现\n * 新增 CUDA_IPC 的设备批量写入、部分写入及原子操作\n * 新增 GDAKI 的对端故障错误处理能力\n * 新增在使用 GDA 传输时检查 nvidia_peermem 驱动的功能\n * 默认启用 IB 传输的 Direct NIC 功能\n * 新增 XDR 性能识别\n * 新增对 Direct NIC 通过 PCIe 映射 DMA_BUF 句柄的支持\n * 通过快速路径缓存查找提升 GDR_COPY 性能\n#### RDMA CORE（IB、ROCE 等）\n * 新增 ConnectX-9 设备支持\n * 将 dp_ordering 标志位拆分为 DV 和 DevX 传输专用\n * 新增 VRF 表支持，用于 RoCE 可达性检查\n * 新增针对 EFA 特有的 GPUDirect 支持检测\n#### TCP\n * 新增在可达性验证过程中进行路由表检查的功能\n#### UCS\n * 引入轻量级读写锁数据结构\n * 为 rcache 读写锁添加内置原子操作\n * 改进 VFS 符号链接路径及重复对象的处理\n * 默认禁用错误信号拦截功能\n#### CUDA\n * 新增 NVML 函数的封装接口\n * 新增 cuLibraryGetGlobal 的钩子函数\n * 改进 CUDA 调用日志记录\n * 改进通道性能估算中的源\u002F目标内存类型检测\n * 移除不安全的 cuCtxGetId 使用方式\n * 新增对较新 CUDA 版本的 cuCtxCreate_v4 支持\n * 改进 CUDA_IPC 操作中的上下文管理\n#### UCM\n * 默认将模块信息打印调整为调试级别\n#### 工具\n * 在 perftest 中新增 GDAKI 内核选项\n * 在 perftest 中新增 UCP CUDA 设备测试\n * 新增 MPI+CUDA 示例\n * 在 perftest 中区分唤醒功能与额外信息选项\n#### 构建\n * 新增为受支持的平台构建 CUDA 设备代码的能力","2026-02-05T14:25:25",{"id":202,"version":203,"summary_zh":204,"released_at":205},62269,"v1.20.0","## 1.20.0\n### 功能特性：\n#### UCP\n * 新增 GPU 设备 API，用于实现 GPU 间直接通信\n * 新增主机端 API，用于 GPU 设备管理\n * 新增设备信号量 API，支持协作级别和标志位设置\n * 新增在设备操作中处理偏移量和通道 ID 的 API\n * 新增在设备操作中写入本地计数器的方法\n * 在设备 API 的内存列表元素中新增本地地址和远程地址字段\n * 新增设备通道选择及分配句柄填充功能\n * 新增对使用 CUDA 的 Direct NIC（DPU）数据路径的支持\n * 新增对 Direct NIC 的 rkey 打包支持\n * 新增当内存 sys_dev 与远程通道 sys_dev 不同时的发送方刷新机制\n * 新增每种协议可使用单个网络设备的选项\n * 新增 MIN_RMA_CHUNK_SIZE 配置参数\n * 将 MIN_RMA_CHUNK_SIZE 的默认值由 16k 降低至 8k\n * 通过 find_lanes 回调改进协议通道选择，以最小化开销\n * 改进快速完成情况下的 send-zcopy 延迟因子\n * 改进多 ppn 性能估算\n * 移除已弃用的 ucp_mem 函数\n * 弃用 ucp_request_alloc API\n#### UCT\n * 新增用于 GPU 通信的新设备 API（rc_gda 传输）\n * 新增 GDAKI 传输，并支持将端点导出至 GPU\n * 新增对外部内存的 DEVX QP\u002FCQ 支持\n * 新增 CUDA_IPC 传输的设备 API 实现\n * 新增 CUDA_IPC 的设备批量写入、部分写入及原子操作\n * 新增 GDAKI 的对端故障错误处理能力\n * 新增在使用 GDA 传输时检查 nvidia_peermem 驱动的功能\n * 默认启用 IB 传输的 Direct NIC 功能\n * 新增 XDR 性能识别\n * 新增对 Direct NIC 通过 PCIe 映射 DMA_BUF 句柄的支持\n * 通过快速路径缓存查找提升 GDR_COPY 性能\n#### RDMA CORE（IB、ROCE 等）\n * 新增 ConnectX-9 设备支持\n * 将 dp_ordering 标志位拆分为 DV 和 DevX 传输专用\n * 新增 VRF 表支持，用于 RoCE 可达性检查\n * 新增针对 EFA 特有的 GPUDirect 支持检测\n#### TCP\n * 新增在可达性验证过程中进行路由表检查的功能\n#### UCS\n * 引入轻量级读写锁数据结构\n * 为 rcache 读写锁添加内置原子操作\n * 改进 VFS 符号链接路径及重复对象的处理\n * 默认禁用错误信号拦截功能\n#### CUDA\n * 新增 NVML 函数的封装接口\n * 新增 cuLibraryGetGlobal 的钩子函数\n * 改进 CUDA 调用日志记录\n * 改进通道性能估算中的源\u002F目标内存类型检测\n * 移除不安全的 cuCtxGetId 使用方式\n * 新增对较新 CUDA 版本的 cuCtxCreate_v4 支持\n * 改进 CUDA_IPC 操作中的上下文管理\n#### UCM\n * 默认将模块信息打印级别调整为调试级别\n#### 工具\n * 在 perftest 中新增 GDAKI 内核选项\n * 在 perftest 中新增 UCP CUDA 设备测试\n * 新增 MPI+CUDA 示例\n * 在 perftest 中区分唤醒功能与额外信息选项\n#### 构建\n * 新增为受支持架构构建 CUDA 设备代码的能力","2026-02-05T13:44:25",{"id":207,"version":208,"summary_zh":209,"released_at":210},62270,"v1.19.1","## 1.19.1（2025年9月18日）\n### 功能：\n#### UCP\n* 如果不使用会合协议，则无需传输内存支持\n#### 构建\n* 在发布流水线中添加了对 CUDA 13 的支持\n* 在发布流水线中添加了对 Rocky OS 的支持\n### 修复：\n#### UCS\n* 修复了 Netlink 获取机制\n","2025-12-07T15:51:08",{"id":212,"version":213,"summary_zh":214,"released_at":215},62271,"v1.19.1-rc2","## 1.19.1（2025年10月21日）\n### 功能：\n#### UCP\n* 如果不使用会合协议，则无需支持传输内存\n#### 构建\n* 在发布流水线中添加了对 CUDA 13 的支持\n* 在发布流水线中添加了对 Rocky OS 的支持\n### 修复：\n#### UCS\n* 修复了 Netlink 获取机制","2025-10-21T14:42:11",{"id":217,"version":218,"summary_zh":219,"released_at":220},62272,"v1.19.1-rc1","## 1.19.1（2025年9月18日）\n### 功能：\n#### UCP\n* 如果不使用会合协议，则无需传输内存支持\n#### 构建\n* 在发布流水线中添加了对 CUDA 13 的支持","2025-09-21T13:36:40",{"id":222,"version":223,"summary_zh":224,"released_at":225},62273,"v1.19.0","## 1.19.0（2025年8月6日）\n### 功能特性：\n#### UCP\n* 在单个进程中启用多GPU支持\n* 在RMA刷新操作中增加了强栅栏与弱栅栏之间的动态选择\n* 改进了端点重新配置能力\n* 为多网卡GPU系统添加了All2All通道选择\n* 当配置缓存限制达到时，改进了rkey调试信息\n* 根据可用内存类型改进了UCP协议选择\n* 从无关传输层（TCP、CMA和CUDA）中移除了虚拟内存密钥\n* 使用设备本地暂存缓冲区提升了RNDV性能\n* 启用了RMA get_offload协议的错误处理\n#### UCT\n* 定义了uct_rkey_unpack_v2 API，以支持传递系统设备\n#### RDMA CORE（IB、ROCE等）\n* 在EFA中增加了SRD传输支持，包括重排序、AM和控制操作\n* 移除了XGVMI BF2支持（umem）\n* 移除了设备内存间接密钥\n* 修复了DCI和池的VFS对象\n* 在可达性检查中添加了路由表缓存\n* 修复了IB辅助rkey中的严格顺序使用问题\n* 改进了各类初始化日志消息\n#### CUDA\n* 为CUDA IPC的远程密钥解包增加了多上下文支持\n* 为CUDA IPC增加了上下文切换感知的资源管理\n* 使用缓冲区ID检测CUDA IPC中的VA回收\n* 增加了在特定系统设备上分配CUDA内存的支持\n* 在CUDA复制中增加了多设备支持\n* 改进了GPU内存操作的协议通道选择\n* 放宽了CUDA复制中对CUDA上下文的要求\n* 增加了CUDA复制中的死锁预防机制\n* 增加了对VMM地址范围检测的支持\n* 在切换CUDA GPU后启用了内存属性查询\n* 为CUDA传输增加了多GPU发送测试\n* 从CUDA复制传输中移除了主机间性能估算\n* 将cuCtxCreate替换为cuDevicePrimaryCtxRetain\n* 改进了各类初始化日志消息\n#### ROCM\n* 增加了IPC句柄缓存和信号池大小的控制参数\n* 通过缓存优化了ROCm内存类型的检测\n#### UCS\n* 移除了编译警告\n#### 工具\n* 为ucx_info增加了名称过滤选项（-F 'str'），用于配置和特性转储\n* 改进了ucx_info的输入验证\n### 错误修复：\n#### UCP\n* 使UCX_TLS=^ib能够禁用所有传输层，包括辅助传输\n* 修复了发送请求状态的处理问题\n* 通过优化md缓存更新，修复了RNDV中的性能下降问题\n* 修复了当第一条通道因分片大小被过滤掉时的协议选择问题\n* 通过使用内存注册标志修复了rkey选择问题\n#### UCT\n#### RDMA CORE（IB、ROCE等）\n* 通过增加DCI验证并分离连接逻辑，提高了DC传输的可靠性\n* 修复了DC栅栏操作中的段错误\n#### GPU（CUDA、ROCM）\n* 更新了ROCm配置，以兼容ROCm 6.3版本\n* 修复了CUDA异步内存操作中的系统设备检测问题\n* 修复了CUDA IPC mpack过程中旧类型检测的问题\n* 通过为本地缓冲区使用正确的上下文，修复了CUDA IPC RMA操作中的问题\n#### UCS\n* 使用UCS函数进行前导计数","2025-08-06T12:23:09",{"id":227,"version":228,"summary_zh":229,"released_at":230},62274,"v1.19.0-rc2","## 1.19.0（2025年6月18日）\n### 功能特性：\n#### UCP\n* 在单个进程中启用多GPU支持\n* 在RMA刷新操作中增加了强栅栏与弱栅栏之间的动态选择\n* 改进了端点重新配置能力\n* 为多网卡GPU系统添加了All2All通道选择\n* 在配置缓存限制达到时，改进了rkey调试信息\n* 根据可用内存类型改进了UCP协议选择\n* 从无关传输层（TCP、CMA和CUDA）中移除了虚拟内存密钥\n* 使用设备本地暂存缓冲区提升了RNDV性能\n* 启用了RMA get_offload协议的错误处理\n#### UCT\n* 定义了uct_rkey_unpack_v2 API，以支持传递sys-dev\n#### RDMA CORE（IB、ROCE等）\n* 在EFA中增加了SRD传输支持，包括重新排序、AM和控制操作\n* 移除了XGVMI BF2支持（umem）\n* 移除了设备内存间接密钥\n* 修复了DCI和池的VFS对象\n* 在可达性检查中添加了路由表缓存\n* 修复了IB辅助rkey中的严格顺序使用问题\n* 改进了各类初始化日志消息\n#### CUDA\n* 为CUDA IPC的远程密钥解包增加了多上下文支持\n* 为CUDA IPC增加了上下文切换感知的资源管理\n* 使用缓冲区ID检测CUDA IPC中的VA回收\n* 增加了在特定系统设备上分配CUDA内存的支持\n* 在CUDA复制中增加了多设备支持\n* 改进了GPU内存操作的协议通道选择\n* 放宽了CUDA复制中对CUDA上下文的要求\n* 在CUDA复制中增加了死锁预防机制\n* 增加了对VMM地址范围检测的支持\n* 在切换CUDA GPU后启用了内存属性查询\n* 为CUDA传输增加了多GPU发送测试\n* 从CUDA复制传输中移除了主机间性能估算\n* 将cuCtxCreate替换为cuDevicePrimaryCtxRetain\n* 改进了各类初始化日志消息\n#### ROCM\n* 增加了IPC句柄缓存和信号池大小的控制参数\n* 通过缓存优化了ROCm内存类型检测\n#### UCS\n* 移除了编译警告\n#### 工具\n* 为ucx_info增加了名称过滤选项（-F 'str'），用于配置和特性转储\n* 改进了ucx_info的输入验证\n### 错误修复：\n#### UCP\n* 使UCX_TLS=^ib能够禁用所有传输层，包括辅助传输\n* 修复了发送请求状态的处理问题\n* 通过优化md缓存更新，修复了RNDV中的性能下降问题\n* 修复了当第一通道因分片大小被过滤时的协议选择问题\n* 通过使用内存注册标志修复了rkey选择问题\n#### UCT\n#### RDMA CORE（IB、ROCE等）\n* 通过增加DCI验证并分离连接逻辑，提高了DC传输的可靠性\n* 修复了DC栅栏操作中的段错误\n#### GPU（CUDA、ROCM）\n* 更新了ROCm配置，以兼容ROCm 6.3\n* 修复了CUDA异步内存操作中的系统设备检测问题\n* 修复了CUDA IPC mpack过程中对旧类型检测的错误\n* 通过为本地缓冲区使用正确的上下文，修复了CUDA IPC RMA操作的问题\n#### UCS\n* 使用UCS函数来计数leadi","2025-07-22T08:17:14",{"id":232,"version":233,"summary_zh":234,"released_at":235},62275,"v1.19.0-rc1","## 1.19.0（2025年6月18日）\n### 功能特性：\n#### UCP\n* 在单个进程中启用多GPU支持\n* 在RMA刷新操作中增加了强栅栏与弱栅栏之间的动态选择\n* 改进了端点重新配置能力\n* 为多网卡GPU系统添加了All2All通道选择\n* 当配置缓存限制达到时，改进了rkey调试信息\n* 根据可用的内存类型改进了UCP协议选择\n* 从无关传输层（TCP、CMA和CUDA）中移除了虚拟内存密钥\n* 使用设备本地暂存缓冲区提升了RNDV性能\n* 启用了RMA get_offload协议的错误处理\n#### UCT\n* 定义了uct_rkey_unpack_v2 API，以支持传递sys-dev\n#### RDMA CORE（IB、ROCE等）\n* 在EFA中增加了SRD传输支持，包括重新排序、AM和控制操作\n* 移除了XGVMI BF2支持（umem）\n* 移除了设备内存间接密钥\n* 修复了DCI和池的VFS对象\n* 在可达性检查中添加了路由表缓存\n* 修复了IB辅助rkey中的严格顺序使用问题\n* 改进了各类初始化日志消息\n#### CUDA\n* 为CUDA IPC的远程密钥解包增加了多上下文支持\n* 为CUDA IPC增加了上下文切换感知的资源管理\n* 使用缓冲区ID检测CUDA IPC中的VA回收\n* 增加了在特定系统设备上分配CUDA内存的支持\n* 在CUDA复制中增加了多设备支持\n* 改进了针对GPU内存操作的协议通道选择\n* 放宽了CUDA复制中对CUDA上下文的要求\n* 在CUDA复制中增加了死锁预防机制\n* 增加了对VMM地址范围检测的支持\n* 在切换CUDA GPU后启用了内存属性查询\n* 为CUDA传输增加了多GPU发送测试\n* 从CUDA复制传输中移除了主机间性能估算\n* 将cuCtxCreate替换为cuDevicePrimaryCtxRetain\n* 改进了各类初始化日志消息\n#### ROCM\n* 增加了IPC句柄缓存和信号池大小的控制参数\n* 通过缓存优化了ROCm内存类型检测\n#### UCS\n* 移除了编译警告\n#### 工具\n* 为ucx_info增加了名称过滤选项（-F 'str'），用于配置和特性转储\n* 改进了ucx_info的输入验证\n### 错误修复：\n#### UCP\n* 使UCX_TLS=^ib能够禁用所有传输层，包括辅助传输\n* 修复了发送请求状态处理问题\n* 通过优化md缓存更新，修复了RNDV中的性能下降问题\n* 修复了当第一通道因分片大小被过滤掉时的协议选择问题\n* 通过使用内存注册标志修复了rkey选择问题\n#### UCT\n#### RDMA CORE（IB、ROCE等）\n* 通过增加DCI验证并分离连接逻辑，提高了DC传输的可靠性\n* 修复了DC栅栏操作中的段错误\n#### GPU（CUDA、ROCM）\n* 更新了ROCm配置，以兼容ROCm 6.3版本\n* 修复了CUDA异步内存操作中的系统设备检测问题\n* 修复了CUDA IPC mpack过程中旧类型检测的问题\n* 通过为本地缓冲区使用正确的上下文，修复了CUDA IPC RMA操作问题\n#### UCS\n* 使用UCS函数来计数leadi","2025-06-24T12:22:39",{"id":237,"version":238,"summary_zh":239,"released_at":240},62276,"v1.18.1","## 1.18.1（2025年4月28日）\n### 功能：\n#### CUDA\n* 添加了用于更新一致性平台下 cuda_copy 带宽的配置键\n* 改进了使用 CUDA 内存池分配的内存的缓存失效机制\n#### AZP\n* 将 Ubuntu 24.04 添加到构建和发布流水线中\n### 修复：\n#### UCP\n* 修复了当最大车道片段小于 AM 头部时出现的断言失败问题\n* 修复了在协议重新配置时可能出现的主动消息用户头部释放后使用问题\n#### CUDA\n* 修复了由 UCT 分配的 CUDA Fabric 内存的注册问题\n* 修复了使用 VMM 和 CUDA 内存池分配的内存的 VA 回收检查问题\n#### RDMA CORE（IB、ROCE 等）\n* 不再使用 ConnectX-8 SMI 子设备进行通信\n* 通过在设备支持 DDP 时禁用 ODP，修复了远程访问错误\n* 通过在 AR 被禁用时禁用 DDP，修复了配置逻辑\n#### UCM\n* 修复了在 amd64 架构上使用 CUDA 12.9 的 bistro 钩子时发生的崩溃","2025-04-28T16:20:26",{"id":242,"version":243,"summary_zh":244,"released_at":245},62277,"v1.18.1-rc3","## 1.18.1-rc3 (April 17, 2025)\r\n### Bugfixes:\r\n#### UCM\r\n* Fixed crash with bistro hooks for CUDA 12.9 on amd64","2025-04-17T17:02:06",{"id":247,"version":248,"summary_zh":249,"released_at":250},62278,"v1.18.1-rc2","## 1.18.1-rc2 (April 9, 2025)\r\n### Features:\r\n#### CUDA\r\n * Added config keys to update cuda_copy bandwidth for coherent platforms\r\n * Improved cache invalidation of memory allocated using CUDA memory pool\r\n### Bugfixes:\r\n#### UCP\r\n * Fixed assertion failure when maximum lane fragment is smaller than AM header\r\n#### CUDA\r\n * Fixed registration of CUDA Fabric memory allocated by UCT\r\n * Fixed VA recycling check of memory allocated using VMM and CUDA memory pool\r\n#### RDMA CORE (IB, ROCE, etc.)\r\n * Do not use ConnectX-8 SMI subdevices for communication\r\n * Fixed remote access error by disabling ODP when the device supports DDP\r\n * Fixed configuration logic by disabling DDP when AR is disabled\r\n","2025-04-09T16:12:22",{"id":252,"version":253,"summary_zh":254,"released_at":255},62279,"v1.18.1-rc1","## 1.18.1-rc1 (February 20, 2025)\r\n### Features:\r\n#### AZP\r\n * Added Ubuntu 24.04 to build and release pipeline\r\n### Bugfixes:\r\n#### UCP\r\n * Fixed potential active message user header use after free with protocol reconfiguration\r\n","2025-02-21T22:58:14",{"id":257,"version":258,"summary_zh":259,"released_at":260},62280,"v1.18.0","## 1.18.0 (January 17, 2025)\r\n### Features:\r\n#### UCP\r\n * Enabled using CUDA staging buffers for pipeline protocols by default\r\n * Added endpoint reconfiguration support for non-reused p2p scenarios\r\n * Enabled non-cacheable memory domains, activated for gdr_copy\r\n * Added user_data parameter to ucp_ep_query\r\n * Added support for host memory pipeline through CUDA buffers for rendezvous protocol\r\n * Added global VA infrastructure and memory region in absence of error handling\r\n * Made protocol performance node names more informative\r\n * Enforced always running on the same thread in single thread mode\r\n * Multiple improvements in protocols selection infrastructure\r\n * Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping\r\n * Allowed up-to 64 endpoint lanes for systems with many transports or devices\r\n * Added usage tracker to worker\r\n * Improved various logging messages\r\n#### RDMA CORE (IB, ROCE, etc.)\r\n * Added environment variable to manage DC initiator capacity\r\n * Added DC dcs_hybrid policy\r\n * Reduced MLX5\u002FDV stack size consumption\r\n * Added ODP support for verbs and mlx5dv\r\n * Added support of CUDA managed memory on IB when ODP is available\r\n * Added support of Adaptive Routing on RoCE\r\n * Enabled use of implicit ODP with relaxed ordering\r\n * Improved GPU-Direct detection in IB transport\r\n * Increased DC initiator default count to 32 for performance optimization\r\n * Added ConnectX-8 device support with DDP\r\n * Added support for subnet filter list for RoCE interfaces\r\n * Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports\r\n * Added IB MLX5 as a separate UCX module with separate RPM sub-package\r\n * Added initial support for GGA transport, for fast DPU memory access\r\n * Set IB DevX atomic mode based on device capabilities\r\n * Removed DC keepalive mechanism, since the keepalive is done on UCP layer\r\n * Optimized cross-gVMI memory registration using indirect memory keys cache\r\n * Improved various logging messages\r\n#### CUDA\r\n * Added multi-node NVlink support\r\n * Added CUDA Fabric memory support with detection and allocation\r\n * Improved gdr_copy latency estimations on AMD Milan systems\r\n * Added check for gdr_copy runtime\u002Fbuild version mismatch\r\n * Added handling missing IPC capability when unpacking keys\r\n * Added caching for CUDA IPC memory pool import operation\r\n * Added gdr_copy variables to optimize performance on Grace Hopper systems\r\n * Improved CUDA IPC concurrency for a larger count of reachable peers\r\n#### UCS\r\n * Added support for wildcards in configuration parameter names\r\n * Added ASAN protection to several internal data structures\r\n * Reduced stack usage in topology detection code\r\n * Improved bitmaps configuration parsing with wider bitfield\r\n * Added options to set topology distance between devices\r\n * Optimized VFS unix socket watch by using user private folder\r\n * Added general IP subnet matching infrastructure\r\n * Extend array data structure to support user-provided array copy routine\r\n * Improved time units description\r\n#### UCM\r\n * Extend CUDA memory hooks to include memory mapping APIs\r\n#### Tools\r\n * Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest\r\n * Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest\r\n * Improved ucx_perftest uni-directional test with added fence\r\n * Detailed ucx_perftest batch section of command-line documentation\r\n#### Documentation\r\n * Added a section regarding adaptive routing on RoCE\r\n#### Architecture\r\n * Added CPU Model for MI300A\r\n * Added Fujitsu ARM specific values to ucx.conf\r\n * Added AMD Turin support\r\n * Added an optimized non-temporal memory copy implementation for AMD CPU\r\n#### Build\r\n * Improved compiler error reporting with added flag\r\n * Improved coverity script to allow faster turnaround time\r\n * Improved Intel Compiler detection and support\r\n#### GO\r\n * Added multi-send flag and user memh support in request params\r\n#### Packaging\r\n * Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments\r\n### Bugfixes:\r\n#### UCP\r\n * Fixed stack overflow in exported rkey unpack\r\n * Removed extra remote-cpu overhead from protocol estimation for zcopy\r\n * Fixed performance estimation for rndv pipeline protocols\r\n * Fixed ATP sending by picking the correct lane\r\n * Fixed missing reg_id on memh creation\r\n * Fixed repeated invalidations by retaining existing access flags\r\n * Fixed abort reason propagation for rendezvous RTR mtype\r\n * Do not check transport availability if it is disabled by UCX_TLS environment variable\r\n * Fixed wrong flag being used for checking BCOPY capability\r\n * Fixed sending too many ATPs for small messages\r\n * Enforced 16 bits size for Active Messages identifiers\r\n * Fixed unnecessary status check for emulated AMO\r\n * Fixed more than one fragment sending in rendezvous pipeline\r\n * Fixed crash by using biggest max frag across all lanes\r\n * Fixed missing memory hand","2025-01-21T09:39:52",{"id":262,"version":263,"summary_zh":264,"released_at":265},62281,"v1.18.0-rc3","## 1.18.0-rc3 (December 23, 2024)\r\n### Features:\r\n#### UCP\r\n * Enabled using CUDA staging buffers for pipeline protocols by default\r\n * Added endpoint reconfiguration support for non-reused p2p scenarios\r\n * Enabled non-cacheable memory domains, activated for gdr_copy\r\n * Added user_data parameter to ucp_ep_query\r\n * Added support for host memory pipeline through CUDA buffers for rendezvous protocol\r\n * Added global VA infrastructure and memory region in absence of error handling\r\n * Made protocol performance node names more informative\r\n * Enforced always running on the same thread in single thread mode\r\n * Multiple improvements in protocols selection infrastructure\r\n * Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping\r\n * Allowed up-to 64 endpoint lanes for systems with many transports or devices\r\n * Added usage tracker to worker\r\n * Improved various logging messages\r\n#### RDMA CORE (IB, ROCE, etc.)\r\n * Added environment variable to manage DC initiator capacity\r\n * Added DC dcs_hybrid policy\r\n * Reduced MLX5\u002FDV stack size consumption\r\n * Added ODP support for verbs and mlx5dv\r\n * Added support of CUDA managed memory on IB when ODP is available\r\n * Added support of Adaptive Routing on RoCE\r\n * Enabled use of implicit ODP with relaxed ordering\r\n * Improved GPU-Direct detection in IB transport\r\n * Increased DC initiator default count to 32 for performance optimization\r\n * Added ConnectX-8 device support with DDP\r\n * Added support for subnet filter list for RoCE interfaces\r\n * Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports\r\n * Added IB MLX5 as a separate UCX module with separate RPM sub-package\r\n * Added initial support for GGA transport, for fast DPU memory access\r\n * Set IB DevX atomic mode based on device capabilities\r\n * Removed DC keepalive mechanism, since the keepalive is done on UCP layer\r\n * Optimized cross-gVMI memory registration using indirect memory keys cache\r\n * Improved various logging messages\r\n#### CUDA\r\n * Added multi-node NVlink support\r\n * Added CUDA Fabric memory support with detection and allocation\r\n * Improved gdr_copy latency estimations on AMD Milan systems\r\n * Added check for gdr_copy runtime\u002Fbuild version mismatch\r\n * Added handling missing IPC capability when unpacking keys\r\n * Added caching for CUDA IPC memory pool import operation\r\n * Added gdr_copy variables to optimize performance on Grace Hopper systems\r\n * Improved CUDA IPC concurrency for a larger count of reachable peers\r\n#### UCS\r\n * Added support for wildcards in configuration parameter names\r\n * Added ASAN protection to several internal data structures\r\n * Reduced stack usage in topology detection code\r\n * Improved bitmaps configuration parsing with wider bitfield\r\n * Added options to set topology distance between devices\r\n * Optimized VFS unix socket watch by using user private folder\r\n * Added general IP subnet matching infrastructure\r\n * Extend array data structure to support user-provided array copy routine\r\n * Improved time units description\r\n#### UCM\r\n * Extend CUDA memory hooks to include memory mapping APIs\r\n#### Tools\r\n * Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest\r\n * Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest\r\n * Improved ucx_perftest uni-directional test with added fence\r\n * Detailed ucx_perftest batch section of command-line documentation\r\n#### Documentation\r\n * Added a section regarding adaptive routing on RoCE\r\n#### Architecture\r\n * Added CPU Model for MI300A\r\n * Added Fujitsu ARM specific values to ucx.conf\r\n * Added AMD Turin support\r\n * Added an optimized non-temporal memory copy implementation for AMD CPU\r\n#### Build\r\n * Improved compiler error reporting with added flag\r\n * Improved coverity script to allow faster turnaround time\r\n * Improved Intel Compiler detection and support\r\n#### GO\r\n * Added multi-send flag and user memh support in request params\r\n#### Packaging\r\n * Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments\r\n### Bugfixes:\r\n#### UCP\r\n * Fixed stack overflow in exported rkey unpack\r\n * Removed extra remote-cpu overhead from protocol estimation for zcopy\r\n * Fixed performance estimation for rndv pipeline protocols\r\n * Fixed ATP sending by picking the correct lane\r\n * Fixed missing reg_id on memh creation\r\n * Fixed repeated invalidations by retaining existing access flags\r\n * Fixed abort reason propagation for rendezvous RTR mtype\r\n * Do not check transport availability if it is disabled by UCX_TLS environemnt variable\r\n * Fixed wrong flag being used for checking BCOPY capability\r\n * Fixed sending too many ATPs for small messages\r\n * Enforced 16 bits size for Active Messages identifiers\r\n * Fixed unnecessary status check for emulated AMO\r\n * Fixed more than one fragment sending in rendezvous pipeline\r\n * Fixed crash by using biggest max frag across all lanes\r\n * Fixed missing memory","2024-12-23T17:06:36",{"id":267,"version":268,"summary_zh":269,"released_at":270},62282,"v1.18.0-rc2","## 1.18.0-rc2 (December 10, 2024)\r\n### Features: TBD\r\n### Bugfixes: TBD","2024-12-10T16:40:00",{"id":272,"version":273,"summary_zh":274,"released_at":275},62283,"v1.18.0-rc1","## 1.18.0-rc1 (November 26, 2024)\r\n### Features: TBD\r\n### Bugfixes: TBD","2024-11-26T13:25:40",{"id":277,"version":278,"summary_zh":279,"released_at":280},62284,"v1.17.0","## 1.17.0 (June 13, 2024)\r\n### Features:\r\n#### UCP\r\n* Improved the accuracy of rendezvous protocol performance estimation\r\n* Enabled short protocol for non-host memory types on empty messages\r\n* Improved the accuracy of performance estimation for empty messages by removing non-relevant overheads\r\n* Added RMA_ZCOPY_MAX_SEG_SIZE configuration parameter to allow modifying segment size for RMA-ZCOPY protocols\r\n* Added support for separate intra\u002Finter-node rendezvous thresholds\r\n* Added support for minimal fragment size in rendezvous protocol\r\n* Added support for resetting request during send operation\r\n* Added UCX_PROTO_OVERHEAD configuration variable to allow setting protocol overheads\r\n* Improved performance for combined Active Message\u002FRMA scenarios by separating them to different lanes\r\n* Added support for device staging buffers in pipeline protocols\r\n* Enabled on-demand paging for Nvidia's Grace platforms by default \r\n#### RDMA CORE (IB, ROCE, etc.)\r\n* Introduced the UCX_REVERSE_SL environment variable to configure reverse SL for DC transport. By default, it uses UCX_IB_SL.\r\n* Added support for GID auto-detection in Floating LID based routing\r\n* Added support for multithreading KSM registration of unaligned buffers\r\n* Added IB_SEND_OVERHEAD and MM_[SEND|RECV]_OVERHEAD configuration variables\r\n#### GPU (CUDA, ROCM)\r\n* Added support for oneAPI Level-Zero library for Intel GPUs\r\n#### UCS\r\n* Added support for rcache dynamic region alignment\r\n* Added dynamic bitmap data structure\r\n* Added support for advanced key-value parsing for UCX configuration\r\n* Added piecewise linear function data structure\r\n* Added support for allocating dynamic arrays on stack \r\n#### Tools\r\n* Added support for device memory allocation in UCX perftest\r\n* Added a script to use for squashing commits after PR approval\r\n* Added support for DPU cross-gvmi daemon in UCX perftest\r\n#### Java\r\n* Added support for EP local socket address API in JUCX\r\n#### Build\r\n* Added address sanitizer support\r\n* Added a helper shell script to run static checks\r\n#### AZP\r\n* Replaced Valgrind tests with address sanitizer tool\r\n* Added Ubuntu 22.04 docker image testing\r\n#### Configuration\r\n* Added support for filtering configuration sections by platform type\r\n* Added configuration file with section for Grace Hopper\r\n### Bugfixes:\r\n#### UCP\r\n* Fixed crash due to incorrect lane selection when active message is disabled\r\n* Fixed RMA lane selection issue due to wrong bandwidth calculation \r\n* Fixed rendezvous protocol information in protocol details table\r\n* Fixed endpoint reconfiguration issue due to wrong bandwidth calculation\r\n* Fixed Active Message handlers issue due to out of order registration\r\n* Fixed registration of memh evens for imported memory key\r\n* Fixed sockaddr unreachable destination error handling\r\n* Fixed uninitialized memory issue in new protocols infrastructure\r\n* Fixed race condition when using strong fence by flushing all endpoints\r\n* Fixed incorrect RMA message size on immediate completion with no datatype\r\n* Fixed incorrect performance estimation due to fp8 pack\u002Funpack issue\r\n* Fixed remote access error when rcache memory is not registered with atomic access\r\n* Fixed assertion failure when rcache fails during memh allocation\r\n* Fixed atomic device selection issue\r\n* Fixed worker interface deactivation while still in use by endpoints\r\n* Fixed wire compatibility issue due to mismatched lane selection\r\n#### RDMA CORE (IB, ROCE, etc.)\r\n* Disabled device memory if atomics are not available\r\n* Fixed indirect keys creation for MT registered memory\r\n* Fixed KSM start address value when creating export key \r\n* Fixed DCI pool index to support maximum of 16 pools\r\n* Fixed atomic rkey issue when using imported memory\r\n* Fixed crash due to unsupported SRQ capability\r\n#### GPU (CUDA, ROCM)\r\n* Removed unused environment variable RCACHE_ADDR_ALIGN from ROCm transport\r\n* Fixed usage of cuda device 0 when no context is active\r\n* Removed error handling support from CUDA IPC transport\r\n* Fixed allocation of unaligned CUDA memory\r\n#### Shared Memory\r\n* Fixed occasional crash when shm_unlink fails during interface initialization\r\n#### UCS\r\n* Fixed system device distance calculation for devices on different PCIe root\r\n* Fixed support for large size arrays in ucs_array\r\n* Fixed synchronization issue in rcache\r\n* Fixed uninitialized variable access in rcache\r\n#### Tests\r\n* Fixed test failures when GPU is present but disabled\r\n* Fixed Active Message hanging issue in ucp_client_server\r\n* Fixed potential crash due to redundant munmap call in ucp mmap tests\r\n* Fixed a crash when running CUDA gtest under valgrind\r\n* Fixed UD endpoint timeout issue under Valgrind\r\n#### Java\r\n* Fixed failures in Java tests by waiting for send requests completion\r\n* Fixed JVM segfault in Java tests when gdrcopy driver is not loaded\r\n* Fixed go build and go tests failures\r\n#### Packaging\r\n* Disabled Go bindings in Debian package","2024-06-13T15:35:02",{"id":282,"version":283,"summary_zh":284,"released_at":285},62285,"v1.17.0-rc3","## 1.17.0 RC3 (June 6, 2024)\r\n### Bugfixes:\r\n#### UCP\r\n* Fixed wire compatibility issue due to mismatched lane selection\r\n#### UCS\r\n* Fixed uninitialized variable access in rcache","2024-06-06T16:47:20",{"id":287,"version":288,"summary_zh":289,"released_at":290},62286,"v1.17.0-rc2","## 1.17.0 RC2 (May 29, 2024)\r\n### Features:\r\n#### UCP\r\n* Improved the accuracy of rendezvous protocol performance estimation\r\n* Enabled short protocol for non-host memory types on empty messages\r\n* Improved the accuracy of performance estimation for empty messages by removing non-relevant overheads\r\n* Added RMA_ZCOPY_MAX_SEG_SIZE configuration parameter to allow modifying segment size for RMA-ZCOPY protocols\r\n* Added support for separate intra\u002Finter-node rendezvous thresholds\r\n* Added support for minimal fragment size in rendezvous protocol\r\n* Added support for resetting request during send operation\r\n* Added UCX_PROTO_OVERHEAD configuration variable to allow setting protocol overheads\r\n* Improved performance for combined Active Message\u002FRMA scenarios by separating them to different lanes\r\n* Added support for device staging buffers in pipeline protocols\r\n* Enabled on-demand paging for Nvidia's Grace platforms by default \r\n#### RDMA CORE (IB, ROCE, etc.)\r\n* Introduced the UCX_REVERSE_SL environment variable to configure reverse SL for DC transport. By default, it uses UCX_IB_SL.\r\n* Added support for GID auto-detection in Floating LID based routing\r\n* Added support for multithreading KSM registration of unaligned buffers\r\n* Added IB_SEND_OVERHEAD and MM_[SEND|RECV]_OVERHEAD configuration variables\r\n#### GPU (CUDA, ROCM)\r\n* Added support for oneAPI Level-Zero library for Intel GPUs\r\n#### UCS\r\n* Added support for rcache dynamic region alignment\r\n* Added dynamic bitmap data structure\r\n* Added support for advanced key-value parsing for UCX configuration\r\n* Added piecewise linear function data structure\r\n* Added support for allocating dynamic arrays on stack \r\n#### Tools\r\n* Added support for device memory allocation in UCX perftest\r\n* Added a script to use for squashing commits after PR approval\r\n* Added support for DPU cross-gvmi daemon in UCX perftest\r\n#### Java\r\n* Added support for EP local socket address API in JUCX\r\n#### Build\r\n* Added address sanitizer support\r\n* Added a helper shell script to run static checks\r\n#### AZP\r\n* Replaced Valgrind tests with address sanitizer tool\r\n* Added Ubuntu 22.04 docker image testing\r\n#### Configuration\r\n* Added support for filtering configuration sections by platform type\r\n* Added configuration file with section for Grace Hopper\r\n### Bugfixes:\r\n#### UCP\r\n* Fixed crash due to incorrect lane selection when active message is disabled\r\n* Fixed RMA lane selection issue due to wrong bandwidth calculation \r\n* Fixed rendezvous protocol information in protocol details table\r\n* Fixed endpoint reconfiguration issue due to wrong bandwidth calculation\r\n* Fixed Active Message handlers issue due to out of order registration\r\n* Fixed registration of memh evens for imported memory key\r\n* Fixed sockaddr unreachable destination error handling\r\n* Fixed uninitialized memory issue in new protocols infrastructure\r\n* Fixed race condition when using strong fence by flushing all endpoints\r\n* Fixed incorrect RMA message size on immediate completion with no datatype\r\n* Fixed incorrect performance estimation due to fp8 pack\u002Funpack issue\r\n* Fixed remote access error when rcache memory is not registered with atomic access\r\n* Fixed assertion failure when rcache fails during memh allocation\r\n* Fixed atomic device selection issue\r\n* Fixed worker interface deactivation while still in use by endpoints\r\n#### RDMA CORE (IB, ROCE, etc.)\r\n* Disabled device memory if atomics are not available\r\n* Fixed indirect keys creation for MT registered memory\r\n* Fixed KSM start address value when creating export key \r\n* Fixed DCI pool index to support maximum of 16 pools\r\n* Fixed atomic rkey issue when using imported memory\r\n* Fixed crash due to unsupported SRQ capability\r\n#### GPU (CUDA, ROCM)\r\n* Removed unused environment variable RCACHE_ADDR_ALIGN from ROCm transport\r\n* Fixed usage of cuda device 0 when no context is active\r\n* Removed error handling support from CUDA IPC transport\r\n* Fixed allocation of unaligned CUDA memory\r\n#### Shared Memory\r\n* Fixed occasional crash when shm_unlink fails during interface initialization\r\n#### UCS\r\n* Fixed system device distance calculation for devices on different PCIe root\r\n* Fixed support for large size arrays in ucs_array\r\n* Fixed synchronization issue in rcache\r\n#### Tests\r\n* Fixed test failures when GPU is present but disabled\r\n* Fixed Active Message hanging issue in ucp_client_server\r\n* Fixed potential crash due to redundant munmap call in ucp mmap tests\r\n* Fixed a crash when running CUDA gtest under valgrind\r\n* Fixed UD endpoint timeout issue under Valgrind\r\n#### Java\r\n* Fixed failures in Java tests by waiting for send requests completion\r\n* Fixed JVM segfault in Java tests when gdrcopy driver is not loaded\r\n* Fixed go build and go tests failures\r\n#### Packaging\r\n* Disabled Go bindings in Debian package","2024-06-03T08:10:17"]