[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-deepseek-ai--DeepEP":3,"tool-deepseek-ai--DeepEP":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160015,2,"2026-04-18T11:30:52",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":76,"owner_website":78,"owner_url":79,"languages":80,"stars":101,"forks":102,"last_commit_at":103,"license":104,"difficulty_score":105,"env_os":106,"env_gpu":107,"env_ram":108,"env_deps":109,"category_tags":115,"github_topics":76,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":116,"updated_at":117,"faqs":118,"releases":146},9121,"deepseek-ai\u002FDeepEP","DeepEP","DeepEP: an efficient expert-parallel communication library","DeepEP 是一款专为混合专家模型（MoE）设计的高效通信库，旨在解决大规模分布式训练中专家并行（EP）带来的通信瓶颈问题。它通过提供高吞吐、低延迟的 GPU 内核，优化了数据在专家节点间的“分发”与“合并”过程，显著提升了训练和推理效率。\n\n该工具特别适合从事大模型研发的工程师与研究人员，尤其是需要部署或优化 DeepSeek-V3 等采用群限制门控算法模型的团队。DeepEP 的核心亮点在于其针对不同场景的精细化优化：在训练及推理预填充阶段，它支持跨 NVLink 和 RDMA 域的非对称带宽转发，最大化利用硬件资源；而在对延迟敏感的推理解码阶段，则提供纯 RDMA 低延迟内核，并独创了基于钩子（hook）的通算重叠技术，在不占用计算核心的前提下隐藏通信开销。此外，DeepEP 原生支持 FP8 等低精度运算，并在最新更新中进一步融合了腾讯网络平台部的优化成果，大幅提升了性能表现。无论是单机多卡还是跨节点集群，DeepEP 都能为高性能 MoE 模型提供坚实的底层通信支撑。","# DeepEP\n\nDeepEP is a communication library tailored for Mixture-of-Experts (MoE) and expert parallelism (EP). It provides high-throughput and low-latency all-to-all GPU kernels, which are also known as MoE dispatch and combine. The library also supports low-precision operations, including FP8.\n\nTo align with the group-limited gating algorithm proposed in the [DeepSeek-V3](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3) paper, DeepEP offers a set of kernels optimized for asymmetric-domain bandwidth forwarding, such as forwarding data from NVLink domain to RDMA domain. These kernels deliver high throughput, making them suitable for both training and inference prefilling tasks. Additionally, they support SM (Streaming Multiprocessors) number control.\n\nFor latency-sensitive inference decoding, DeepEP includes a set of low-latency kernels with pure RDMA to minimize delays. The library also introduces a hook-based communication-computation overlapping method that does not occupy any SM resource.\n\nNotice: the implementation in this library may have some slight differences from the [DeepSeek-V3](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3) paper.\n\n## Performance\n\n### Normal kernels with NVLink and RDMA forwarding\n\nWe test normal kernels on H800 (~160 GB\u002Fs NVLink maximum bandwidth), with each connected to a CX7 InfiniBand 400 Gb\u002Fs RDMA network card (~50 GB\u002Fs maximum bandwidth). And we follow the DeepSeek-V3\u002FR1 pretraining setting (4096 tokens per batch, 7168 hidden, top-4 groups, top-8 experts, FP8 dispatching and BF16 combining).\n\n|   Type    | Dispatch #EP | Bottleneck bandwidth | Combine #EP | Bottleneck bandwidth |\n|:---------:|:------------:|:--------------------:|:-----------:|:--------------------:|\n| Intranode |      8       |  153 GB\u002Fs (NVLink)   |      8      |  158 GB\u002Fs (NVLink)   |\n| Internode |      16      |    43 GB\u002Fs (RDMA)    |     16      |    43 GB\u002Fs (RDMA)    |\n| Internode |      32      |    58 GB\u002Fs (RDMA)    |     32      |    57 GB\u002Fs (RDMA)    |\n| Internode |      64      |    51 GB\u002Fs (RDMA)    |     64      |    50 GB\u002Fs (RDMA)    |\n\n**News (2025.04.22)**: with optimizations from Tencent Network Platform Department, performance was enhanced by up to 30%, see [#130](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F130) for more details. Thanks for the contribution!\n\n### Low-latency kernels with pure RDMA\n\nWe test low-latency kernels on H800 with each connected to a CX7 InfiniBand 400 Gb\u002Fs RDMA network card (~50 GB\u002Fs maximum bandwidth). And we follow a typical DeepSeek-V3\u002FR1 production setting (128 tokens per batch, 7168 hidden, top-8 experts, FP8 dispatching and BF16 combining).\n\n| Dispatch #EP | Latency | RDMA bandwidth | Combine #EP | Latency | RDMA bandwidth |\n|:------------:|:-------:|:--------------:|:-----------:|:-------:|:--------------:|\n|      8       |  77 us  |    98 GB\u002Fs     |      8      | 114 us  |    127 GB\u002Fs    |\n|      16      | 118 us  |    63 GB\u002Fs     |     16      | 195 us  |    74 GB\u002Fs     |\n|      32      | 155 us  |    48 GB\u002Fs     |     32      | 273 us  |    53 GB\u002Fs     |\n|      64      | 173 us  |    43 GB\u002Fs     |     64      | 314 us  |    46 GB\u002Fs     |\n|     128      | 192 us  |    39 GB\u002Fs     |     128     | 369 us  |    39 GB\u002Fs     |\n|     256      | 194 us  |    39 GB\u002Fs     |     256     | 360 us  |    40 GB\u002Fs     |\n\n**News (2025.06.05)**: low-latency kernels now leverage NVLink as much as possible, see [#173](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F173) for more details. Thanks for the contribution!\n\n## Quick start\n\n### Requirements\n\n- Ampere (SM80), Hopper (SM90) GPUs, or other architectures with SM90 PTX ISA support\n- Python 3.8 and above\n- CUDA version\n  - CUDA 11.0 and above for SM80 GPUs\n  - CUDA 12.3 and above for SM90 GPUs\n- PyTorch 2.1 and above\n- NVLink for intranode communication\n- RDMA network for internode communication\n\n### Download and install NVSHMEM dependency\n\nDeepEP also depends on NVSHMEM. Please refer to our [NVSHMEM Installation Guide](third-party\u002FREADME.md) for instructions.\n\n### Development\n\n```bash\n# Build and make symbolic links for SO files\nNVSHMEM_DIR=\u002Fpath\u002Fto\u002Finstalled\u002Fnvshmem python setup.py build\n# You may modify the specific SO names according to your own platform\nln -s build\u002Flib.linux-x86_64-cpython-38\u002Fdeep_ep_cpp.cpython-38-x86_64-linux-gnu.so\n\n# Run test cases\n# NOTES: you may modify the `init_dist` function in `tests\u002Futils.py`\n# according to your own cluster settings, and launch into multiple nodes\npython tests\u002Ftest_intranode.py\npython tests\u002Ftest_internode.py\npython tests\u002Ftest_low_latency.py\n```\n\n### Installation\n\n```bash\nNVSHMEM_DIR=\u002Fpath\u002Fto\u002Finstalled\u002Fnvshmem python setup.py install\n```\n\n#### Installation environment variables\n\n- `NVSHMEM_DIR`: the path to the NVSHMEM directory, disable all internode and low-latency features if not specified\n- `DISABLE_SM90_FEATURES`: 0 or 1, whether to disable SM90 features, it is required for SM90 devices or CUDA 11\n- `TORCH_CUDA_ARCH_LIST`: the list of target architectures, e.g. `TORCH_CUDA_ARCH_LIST=\"9.0\"`\n- `DISABLE_AGGRESSIVE_PTX_INSTRS`: 0 or 1, whether to disable aggressive load\u002Fstore instructions, see [Undefined-behavior PTX usage](#undefined-behavior-ptx-usage) for more details\n\nThen, import `deep_ep` in your Python project, and enjoy!\n\n## Network configurations\n\nDeepEP is fully tested with InfiniBand networks. However, it is theoretically compatible with RDMA over Converged Ethernet (RoCE) as well.\n\n### Traffic isolation\n\nTraffic isolation is supported by InfiniBand through Virtual Lanes (VL).\n\nTo prevent interference between different types of traffic, we recommend segregating workloads across different virtual lanes as follows:\n\n- workloads using normal kernels\n- workloads using low-latency kernels\n- other workloads\n\nFor DeepEP, you can control the virtual lane assignment by setting the `NVSHMEM_IB_SL` environment variable.\n\n### Adaptive routing\n\nAdaptive routing is an advanced routing feature provided by InfiniBand switches that can evenly distribute traffic across multiple paths. Enabling adaptive routing can completely eliminate network congestion caused by routing conflicts, but it also introduces additional latency. We recommend the following configuration for optimal performance:\n\n- enable adaptive routing in environments with heavy network loads\n- use static routing in environments with light network loads\n\n### Congestion control\n\nCongestion control is disabled as we have not observed significant congestion in our production environment.\n\n## Interfaces and examples\n\n### Example use in model training or inference prefilling\n\nThe normal kernels can be used in model training or the inference prefilling phase (without the backward part) as the below example code shows.\n\n```python\nimport torch\nimport torch.distributed as dist\nfrom typing import List, Tuple, Optional, Union\n\nfrom deep_ep import Buffer, EventOverlap\n\n# Communication buffer (will allocate at runtime)\n_buffer: Optional[Buffer] = None\n\n# Set the number of SMs to use\n# NOTES: this is a static variable\nBuffer.set_num_sms(24)\n\n\n# You may call this function at the framework initialization\ndef get_buffer(group: dist.ProcessGroup, hidden_bytes: int) -> Buffer:\n    global _buffer\n\n    # NOTES: you may also replace `get_*_config` with your auto-tuned results via all the tests\n    num_nvl_bytes, num_rdma_bytes = 0, 0\n    for config in (Buffer.get_dispatch_config(group.size()), Buffer.get_combine_config(group.size())):\n        num_nvl_bytes = max(config.get_nvl_buffer_size_hint(hidden_bytes, group.size()), num_nvl_bytes)\n        num_rdma_bytes = max(config.get_rdma_buffer_size_hint(hidden_bytes, group.size()), num_rdma_bytes)\n\n    # Allocate a buffer if not existed or not enough buffer size\n    if _buffer is None or _buffer.group != group or _buffer.num_nvl_bytes \u003C num_nvl_bytes or _buffer.num_rdma_bytes \u003C num_rdma_bytes:\n        _buffer = Buffer(group, num_nvl_bytes, num_rdma_bytes)\n    return _buffer\n\n\ndef get_hidden_bytes(x: torch.Tensor) -> int:\n    t = x[0] if isinstance(x, tuple) else x\n    return t.size(1) * max(t.element_size(), 2)\n\n\ndef dispatch_forward(x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],\n                     topk_idx: torch.Tensor, topk_weights: torch.Tensor,\n                     num_experts: int, previous_event: Optional[EventOverlap] = None) -> \\\n        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], torch.Tensor, torch.Tensor, List, Tuple, EventOverlap]:\n    # NOTES: an optional `previous_event` means a CUDA event captured that you want to make it as a dependency\n    # of the dispatch kernel, it may be useful with communication-computation overlap. For more information, please\n    # refer to the docs of `Buffer.dispatch`\n    global _buffer\n\n    # Calculate layout before actual dispatch\n    num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, previous_event = \\\n        _buffer.get_dispatch_layout(topk_idx, num_experts,\n                                    previous_event=previous_event, async_finish=True,\n                                    allocate_on_comm_stream=previous_event is not None)\n    # Do MoE dispatch\n    # NOTES: the CPU will wait for GPU's signal to arrive, so this is not compatible with CUDA graph\n    # Unless you specify `num_worst_tokens`, but this flag is for intranode only\n    # For more advanced usages, please refer to the docs of the `dispatch` function\n    recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event = \\\n        _buffer.dispatch(x, topk_idx=topk_idx, topk_weights=topk_weights,\n                         num_tokens_per_rank=num_tokens_per_rank, num_tokens_per_rdma_rank=num_tokens_per_rdma_rank,\n                         is_token_in_rank=is_token_in_rank, num_tokens_per_expert=num_tokens_per_expert,\n                         previous_event=previous_event, async_finish=True,\n                         allocate_on_comm_stream=True)\n    # For event management, please refer to the docs of the `EventOverlap` class\n    return recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event\n\n\ndef dispatch_backward(grad_recv_x: torch.Tensor, grad_recv_topk_weights: torch.Tensor, handle: Tuple) -> \\\n        Tuple[torch.Tensor, torch.Tensor, EventOverlap]:\n    global _buffer\n\n    # The backward process of MoE dispatch is actually a combine\n    # For more advanced usages, please refer to the docs of the `combine` function\n    combined_grad_x, combined_grad_recv_topk_weights, event = \\\n        _buffer.combine(grad_recv_x, handle, topk_weights=grad_recv_topk_weights, async_finish=True)\n\n    # For event management, please refer to the docs of the `EventOverlap` class\n    return combined_grad_x, combined_grad_recv_topk_weights, event\n\n\ndef combine_forward(x: torch.Tensor, handle: Tuple, previous_event: Optional[EventOverlap] = None) -> \\\n        Tuple[torch.Tensor, EventOverlap]:\n    global _buffer\n\n    # Do MoE combine\n    # For more advanced usages, please refer to the docs of the `combine` function\n    combined_x, _, event = _buffer.combine(x, handle, async_finish=True, previous_event=previous_event,\n                                           allocate_on_comm_stream=previous_event is not None)\n\n    # For event management, please refer to the docs of the `EventOverlap` class\n    return combined_x, event\n\n\ndef combine_backward(grad_combined_x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],\n                     handle: Tuple, previous_event: Optional[EventOverlap] = None) -> \\\n        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], EventOverlap]:\n    global _buffer\n\n    # The backward process of MoE combine is actually a dispatch\n    # For more advanced usages, please refer to the docs of the `dispatch` function\n    grad_x, _, _, _, _, event = _buffer.dispatch(grad_combined_x, handle=handle, async_finish=True,\n                                                 previous_event=previous_event,\n                                                 allocate_on_comm_stream=previous_event is not None)\n\n    # For event management, please refer to the docs of the `EventOverlap` class\n    return grad_x, event\n```\n\nMoreover, inside the dispatch function, we may not know how many tokens to receive for the current rank. So an implicit CPU wait for GPU received count signal will be involved, as the following figure shows.\n\n![normal](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdeepseek-ai_DeepEP_readme_36d9f11de5de.png)\n\n### Example use in inference decoding\n\nThe low latency kernels can be used in the inference decoding phase as the below example code shows.\n\n```python\nimport torch\nimport torch.distributed as dist\nfrom typing import Tuple, Optional\n\nfrom deep_ep import Buffer\n\n# Communication buffer (will allocate at runtime)\n# NOTES: there is no SM control API for the low-latency kernels\n_buffer: Optional[Buffer] = None\n\n\n# You may call this function at the framework initialization\ndef get_buffer(group: dist.ProcessGroup, num_max_dispatch_tokens_per_rank: int, hidden: int, num_experts: int) -> Buffer:\n    # NOTES: the low-latency mode will consume much more space than the normal mode\n    # So we recommend that `num_max_dispatch_tokens_per_rank` (the actual batch size in the decoding engine) should be less than 256\n    global _buffer\n    num_rdma_bytes = Buffer.get_low_latency_rdma_size_hint(num_max_dispatch_tokens_per_rank, hidden, group.size(), num_experts)\n\n    # Allocate a buffer if not existed or not enough buffer size\n    if _buffer is None or _buffer.group != group or not _buffer.low_latency_mode or _buffer.num_rdma_bytes \u003C num_rdma_bytes:\n        # NOTES: for the best performance, the QP number **must** be equal to the number of the local experts\n        assert num_experts % group.size() == 0\n        _buffer = Buffer(group, 0, num_rdma_bytes, low_latency_mode=True, num_qps_per_rank=num_experts \u002F\u002F group.size())\n    return _buffer\n\n\ndef low_latency_dispatch(hidden_states: torch.Tensor, topk_idx: torch.Tensor, num_max_dispatch_tokens_per_rank: int, num_experts: int):\n    global _buffer\n\n    # Do MoE dispatch, compatible with CUDA graph (but you may restore some buffer status once you replay)\n    recv_hidden_states, recv_expert_count, handle, event, hook = \\\n        _buffer.low_latency_dispatch(hidden_states, topk_idx, num_max_dispatch_tokens_per_rank, num_experts,\n                                     async_finish=False, return_recv_hook=True)\n\n    # NOTES: the actual tensor will not be received only if you call `hook()`,\n    # it is useful for double-batch overlapping, but **without any SM occupation**\n    # If you don't want to overlap, please set `return_recv_hook=False`\n    # Later, you can use our GEMM library to do the computation with this specific format\n    return recv_hidden_states, recv_expert_count, handle, event, hook\n\n\ndef low_latency_combine(hidden_states: torch.Tensor,\n                        topk_idx: torch.Tensor, topk_weights: torch.Tensor, handle: Tuple):\n    global _buffer\n\n    # Do MoE combine, compatible with CUDA graph (but you may restore some buffer status once you replay)\n    combined_hidden_states, event_overlap, hook = \\\n        _buffer.low_latency_combine(hidden_states, topk_idx, topk_weights, handle,\n                                    async_finish=False, return_recv_hook=True)\n\n    # NOTES: the same behavior as described in the dispatch kernel\n    return combined_hidden_states, event_overlap, hook\n```\n\nFor two-micro-batch overlapping, you can refer to the following figure. With our receiving hook interface, the RDMA network traffic is happening in the background, without costing any GPU SMs from the computation part. But notice, the overlapped parts can be adjusted, i.e., the 4 parts of attention\u002Fdispatch\u002FMoE\u002Fcombine may not have the exact same execution time. You may adjust the stage settings according to your workload.\n\n![low-latency](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdeepseek-ai_DeepEP_readme_6c25ee767f6d.png)\n\n## Roadmap\n\n- [x] AR support\n- [x] Refactor low-latency mode AR code\n- [x] A100 support (intranode only)\n- [x] Support BF16 for the low-latency dispatch kernel\n- [x] Support NVLink protocol for intranode low-latency kernels\n- [ ] TMA copy instead of LD\u002FST\n  - [x] Intranode kernels\n  - [ ] Internode kernels\n  - [ ] Low-latency kernels\n- [ ] SM-free kernels and refactors\n- [ ] Fully remove undefined-behavior PTX instructions\n\n## Notices\n\n#### Easier potential overall design\n\nThe current DeepEP implementation uses queues for communication buffers which save memory but introduce complexity and potential deadlocks. If you're implementing your own version based on DeepEP, consider using fixed-size buffers allocated to maximum capacity for simplicity and better performance. For a detailed discussion of this alternative approach, see https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fissues\u002F39.\n\n#### Undefined-behavior PTX usage\n\n- For extreme performance, we discover and use an undefined-behavior PTX usage: using read-only PTX `ld.global.nc.L1::no_allocate.L2::256B` to **read volatile data**. The PTX modifier `.nc` indicates that a non-coherent cache is used. But the correctness is tested to be guaranteed with `.L1::no_allocate` on Hopper architectures, and performance will be much better. The reason we guess may be: the non-coherent cache is unified with L1, and the L1 modifier is not just a hint but a strong option, so that the correctness can be guaranteed by no dirty data in L1.\n- Initially, because NVCC could not automatically unroll volatile read PTX, we tried using `__ldg` (i.e., `ld.nc`). Even compared to manually unrolled volatile reads, it was significantly faster (likely due to additional compiler optimizations). However, the results could be incorrect or dirty. After consulting the PTX documentation, we discovered that L1 and non-coherent cache are unified on Hopper architectures. We speculated that `.L1::no_allocate` might resolve the issue, leading to this discovery.\n- If you find kernels not working on some other platforms, you may add `DISABLE_AGGRESSIVE_PTX_INSTRS=1` to `setup.py` and disable this, or file an issue.\n\n#### Auto-tuning on your cluster\n\nFor better performance on your cluster, we recommend to run all the tests and use the best auto-tuned configuration. The default configurations are optimized on the DeepSeek's internal cluster.\n\n## License\n\nThis code repository is released under [the MIT License](LICENSE), except for codes that reference NVSHMEM (including `csrc\u002Fkernels\u002Fibgda_device.cuh` and `third-party\u002Fnvshmem.patch`), which are subject to [NVSHMEM SLA](https:\u002F\u002Fdocs.nvidia.com\u002Fnvshmem\u002Fapi\u002Fsla.html).\n\n## Experimental Branches\n\n- [Zero-copy](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F453)\n  - Removing the copy between PyTorch tensors and communication buffers, which reduces the SM usages significantly for normal kernels\n  - This PR is authored by **Tencent Network Platform Department**\n- [Eager](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F437)\n  - Using a low-latency protocol removes the extra RTT latency introduced by RDMA atomic OPs\n- [Hybrid-EP](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Ftree\u002Fhybrid-ep)\n  - A new backend implementation using TMA instructions for minimal SM usage and larger NVLink domain support\n  - Fine-grained communication-computation overlap for single-batch scenarios\n  - PCIe kernel support for non-NVLink environments\n  - NVFP4 data type support\n- [AntGroup-Opt](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Ftree\u002Fantgroup-opt)\n  - This optimization series is authored by **AntGroup Network Platform Department**\n  - [Normal-SMFree](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F347) Eliminating SM from RDMA path by decoupling comm-kernel execution from NIC token transfer, freeing SMs for compute\n  - [LL-SBO](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F483) Overlapping Down GEMM computation with Combine Send communication via signaling mechanism to reduce end-to-end latency\n  - [LL-Layered](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F500) Optimizing cross-node LL operator communication using rail-optimized forwarding and data merging to reduce latency\n- [Mori-EP](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Ftree\u002Fmori-ep)\n  - ROCm\u002FAMD GPU support powered by [MORI](https:\u002F\u002Fgithub.com\u002FROCm\u002Fmori) backend (low-latency mode)\n\n## Community Forks\n\n- [uccl\u002Fuccl-ep](https:\u002F\u002Fgithub.com\u002Fuccl-project\u002Fuccl\u002Ftree\u002Fmain\u002Fep) - Enables running DeepEP on heterogeneous GPUs (e.g., Nvidia, AMD) and NICs (e.g., EFA, Broadcom, CX7)\n- [Infrawaves\u002FDeepEP_ibrc_dual-ports_multiQP](https:\u002F\u002Fgithub.com\u002FInfrawaves\u002FDeepEP_ibrc_dual-ports_multiQP) - Adds multi-QP solution and dual-port NIC support in IBRC transport\n- [antgroup\u002FDeepXTrace](https:\u002F\u002Fgithub.com\u002Fantgroup\u002FDeepXTrace) - A diagnostic analyzer for efficient and precise localization of slow ranks\n- [ROCm\u002Fmori](https:\u002F\u002Fgithub.com\u002FROCm\u002Fmori) - AMD's next-generation communication library for performance-critical AI workloads (e.g., Wide EP, KVCache transfer, Collectives)\n\n## Citation\n\nIf you use this codebase or otherwise find our work valuable, please cite:\n\n```bibtex\n@misc{deepep2025,\n      title={DeepEP: an efficient expert-parallel communication library},\n      author={Chenggang Zhao and Shangyan Zhou and Liyue Zhang and Chengqi Deng and Zhean Xu and Yuxuan Liu and Kuai Yu and Jiashi Li and Liang Zhao},\n      year={2025},\n      publisher = {GitHub},\n      howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP}},\n}\n```\n","# DeepEP\n\nDeepEP 是一款专为混合专家模型（MoE）和专家并行（EP）设计的通信库。它提供了高吞吐量、低延迟的全对全 GPU 内核，也称为 MoE 调度与组合。该库还支持低精度运算，包括 FP8。\n\n为了配合 [DeepSeek-V3](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3) 论文中提出的分组受限门控算法，DeepEP 提供了一组针对非对称域带宽转发优化的内核，例如从 NVLink 域向 RDMA 域的数据转发。这些内核具有高吞吐量，适用于训练和推理预填充任务。此外，它们还支持流式多处理器（SM）数量控制。\n\n对于对延迟敏感的推理解码，DeepEP 包含一组纯 RDMA 的低延迟内核，以最大限度地减少延迟。该库还引入了一种基于钩子的通信-计算重叠方法，不会占用任何 SM 资源。\n\n注意：本库中的实现可能与 [DeepSeek-V3](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3) 论文中的描述存在一些细微差异。\n\n## 性能\n\n### 普通内核与 NVLink 和 RDMA 转发\n\n我们在 H800 上测试了普通内核（NVLink 最大带宽约 160 GB\u002Fs），每台机器都连接到 CX7 InfiniBand 400 Gb\u002Fs RDMA 网卡（最大带宽约 50 GB\u002Fs）。我们遵循 DeepSeek-V3\u002FR1 预训练设置（每批 4096 个 token，隐藏层大小 7168，top-4 分组，top-8 专家，FP8 调度和 BF16 组合）。\n\n|   类型    | 调度 #EP | 瓶颈带宽 | 组合 #EP | 瓶颈带宽 |\n|:---------:|:------------:|:--------------------:|:-----------:|:--------------------:|\n| 节点内     |      8       |  153 GB\u002Fs (NVLink)   |      8      |  158 GB\u002Fs (NVLink)   |\n| 节点间     |      16      |    43 GB\u002Fs (RDMA)    |     16      |    43 GB\u002Fs (RDMA)    |\n| 节点间     |      32      |    58 GB\u002Fs (RDMA)    |     32      |    57 GB\u002Fs (RDMA)    |\n| 节点间     |      64      |    51 GB\u002Fs (RDMA)    |     64      |    50 GB\u002Fs (RDMA)    |\n\n**新闻（2025.04.22）**：在腾讯网络平台部门的优化下，性能提升了高达 30%，详情请参阅 [#130](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F130)。感谢他们的贡献！\n\n### 低延迟内核与纯 RDMA\n\n我们使用 H800 测试了低延迟内核，每台机器都连接到 CX7 InfiniBand 400 Gb\u002Fs RDMA 网卡（最大带宽约 50 GB\u002Fs）。我们遵循典型的 DeepSeek-V3\u002FR1 生产环境设置（每批 128 个 token，隐藏层大小 7168，top-8 专家，FP8 调度和 BF16 组合）。\n\n| 调度 #EP | 延迟 | RDMA 带宽 | 组合 #EP | 延迟 | RDMA 带宽 |\n|:------------:|:-------:|:--------------:|:-----------:|:-------:|:--------------:|\n|      8       |  77 us  |    98 GB\u002Fs     |      8      | 114 us  |    127 GB\u002Fs    |\n|      16      | 118 us  |    63 GB\u002Fs     |     16      | 195 us  |    74 GB\u002Fs     |\n|      32      | 155 us  |    48 GB\u002Fs     |     32      | 273 us  |    53 GB\u002Fs     |\n|      64      | 173 us  |    43 GB\u002Fs     |     64      | 314 us  |    46 GB\u002Fs     |\n|     128      | 192 us  |    39 GB\u002Fs     |     128     | 369 us  |    39 GB\u002Fs     |\n|     256      | 194 us  |    39 GB\u002Fs     |     256     | 360 us  |    40 GB\u002Fs     |\n\n**新闻（2025.06.05）**：低延迟内核现在尽可能利用 NVLink，详情请参阅 [#173](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F173)。感谢他们的贡献！\n\n## 快速入门\n\n### 要求\n\n- Ampere（SM80）、Hopper（SM90）GPU，或支持 SM90 PTX ISA 的其他架构\n- Python 3.8 及以上版本\n- CUDA 版本：\n  - SM80 GPU 需要 CUDA 11.0 及以上版本\n  - SM90 GPU 需要 CUDA 12.3 及以上版本\n- PyTorch 2.1 及以上版本\n- 节点内通信需使用 NVLink\n- 节点间通信需使用 RDMA 网络\n\n### 下载并安装 NVSHMEM 依赖\n\nDeepEP 还依赖于 NVSHMEM。请参考我们的 [NVSHMEM 安装指南](third-party\u002FREADME.md) 获取说明。\n\n### 开发\n\n```bash\n# 构建并创建 SO 文件的符号链接\nNVSHMEM_DIR=\u002Fpath\u002Fto\u002Finstalled\u002Fnvshmem python setup.py build\n# 您可以根据自己的平台修改具体的 SO 文件名\nln -s build\u002Flib.linux-x86_64-cpython-38\u002Fdeep_ep_cpp.cpython-38-x86_64-linux-gnu.so\n\n# 运行测试用例\n# 注意：您可以根据自己的集群设置修改 `tests\u002Futils.py` 中的 `init_dist` 函数，并在多个节点上启动\npython tests\u002Ftest_intranode.py\npython tests\u002Ftest_internode.py\npython tests\u002Ftest_low_latency.py\n```\n\n### 安装\n\n```bash\nNVSHMEM_DIR=\u002Fpath\u002Fto\u002Finstalled\u002Fnvshmem python setup.py install\n```\n\n#### 安装环境变量\n\n- `NVSHMEM_DIR`：NVSHMEM 目录的路径，若未指定则禁用所有节点间和低延迟功能\n- `DISABLE_SM90_FEATURES`：0 或 1，是否禁用 SM90 功能，对于 SM90 设备或 CUDA 11 需要设置此变量\n- `TORCH_CUDA_ARCH_LIST`：目标架构列表，例如 `TORCH_CUDA_ARCH_LIST=\"9.0\"`\n- `DISABLE_AGGRESSIVE_PTX_INSTRS`：0 或 1，是否禁用激进的加载\u002F存储指令，详情请参阅 [未定义行为 PTX 使用](#undefined-behavior-ptx-usage)\n\n然后，在您的 Python 项目中导入 `deep_ep`，即可开始使用！\n\n## 网络配置\n\nDeepEP 已在 InfiniBand 网络上进行了全面测试。然而，理论上它也兼容融合以太网上的 RDMA（RoCE）。\n\n### 流量隔离\n\nInfiniBand 通过虚拟通道（VL）支持流量隔离。\n\n为防止不同类型流量之间的干扰，我们建议将工作负载划分到不同的虚拟通道中，如下所示：\n\n- 使用普通内核的工作负载\n- 使用低延迟内核的工作负载\n- 其他工作负载\n\n对于 DeepEP，您可以通过设置 `NVSHMEM_IB_SL` 环境变量来控制虚拟通道的分配。\n\n### 自适应路由\n\n自适应路由是 InfiniBand 交换机提供的一项高级路由功能，可以将流量均匀地分配到多条路径上。启用自适应路由可以完全消除由路由冲突引起的网络拥塞，但也会引入额外的延迟。我们建议采用以下配置以获得最佳性能：\n\n- 在网络负载较重的环境中启用自适应路由\n- 在网络负载较轻的环境中使用静态路由\n\n### 拥塞控制\n\n由于我们在生产环境中未观察到显著的拥塞现象，因此拥塞控制功能已被禁用。\n\n## 接口与示例\n\n### 模型训练或推理预填充中的使用示例\n\n普通内核可用于模型训练或推理预填充阶段（不包含反向传播部分），如下示例代码所示。\n\n```python\nimport torch\nimport torch.distributed as dist\nfrom typing import List, Tuple, Optional, Union\n\nfrom deep_ep import Buffer, EventOverlap\n\n# 通信缓冲区（将在运行时分配）\n_buffer: Optional[Buffer] = None\n\n# 设置使用的 SM 数量\n\n# 注释：这是一个静态变量\nBuffer.set_num_sms(24)\n\n\n# 您可以在框架初始化时调用此函数\ndef get_buffer(group: dist.ProcessGroup, hidden_bytes: int) -> Buffer:\n    global _buffer\n\n    # 注释：您也可以用所有测试得到的自动调优结果替换 `get_*_config`\n    num_nvl_bytes, num_rdma_bytes = 0, 0\n    for config in (Buffer.get_dispatch_config(group.size()), Buffer.get_combine_config(group.size())):\n        num_nvl_bytes = max(config.get_nvl_buffer_size_hint(hidden_bytes, group.size()), num_nvl_bytes)\n        num_rdma_bytes = max(config.get_rdma_buffer_size_hint(hidden_bytes, group.size()), num_rdma_bytes)\n\n    # 如果缓冲区不存在或大小不足，则分配一个新的缓冲区\n    if _buffer is None or _buffer.group != group or _buffer.num_nvl_bytes \u003C num_nvl_bytes or _buffer.num_rdma_bytes \u003C num_rdma_bytes:\n        _buffer = Buffer(group, num_nvl_bytes, num_rdma_bytes)\n    return _buffer\n\n\ndef get_hidden_bytes(x: torch.Tensor) -> int:\n    t = x[0] if isinstance(x, tuple) else x\n    return t.size(1) * max(t.element_size(), 2)\n\n\ndef dispatch_forward(x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],\n                     topk_idx: torch.Tensor, topk_weights: torch.Tensor,\n                     num_experts: int, previous_event: Optional[EventOverlap] = None) -> \\\n        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], torch.Tensor, torch.Tensor, List, Tuple, EventOverlap]:\n    # 注释：可选的 `previous_event` 表示您希望将其作为调度内核依赖项的 CUDA 事件，\n    # 这在通信与计算重叠时可能很有用。更多信息请参阅 `Buffer.dispatch` 的文档。\n    global _buffer\n\n    # 在实际调度之前计算布局\n    num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, previous_event = \\\n        _buffer.get_dispatch_layout(topk_idx, num_experts,\n                                    previous_event=previous_event, async_finish=True,\n                                    allocate_on_comm_stream=previous_event is not None)\n    # 执行 MoE 调度\n    # 注释：CPU 将等待 GPU 的信号到达，因此这与 CUDA 图不兼容。\n    # 除非您指定 `num_worst_tokens`，但该标志仅适用于节点内部。\n    # 更高级的用法请参阅 `dispatch` 函数的文档。\n    recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event = \\\n        _buffer.dispatch(x, topk_idx=topk_idx, topk_weights=topk_weights,\n                         num_tokens_per_rank=num_tokens_per_rank, num_tokens_per_rdma_rank=num_tokens_per_rdma_rank,\n                         is_token_in_rank=is_token_in_rank, num_tokens_per_expert=num_tokens_per_expert,\n                         previous_event=previous_event, async_finish=True,\n                         allocate_on_comm_stream=True)\n    # 关于事件管理，请参阅 `EventOverlap` 类的文档。\n    return recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event\n\n\ndef dispatch_backward(grad_recv_x: torch.Tensor, grad_recv_topk_weights: torch.Tensor, handle: Tuple) -> \\\n        Tuple[torch.Tensor, torch.Tensor, EventOverlap]:\n    global _buffer\n\n    # MoE 调度的反向过程实际上是组合操作。\n    # 更高级的用法请参阅 `combine` 函数的文档。\n    combined_grad_x, combined_grad_recv_topk_weights, event = \\\n        _buffer.combine(grad_recv_x, handle, topk_weights=grad_recv_topk_weights, async_finish=True)\n\n    # 关于事件管理，请参阅 `EventOverlap` 类的文档。\n    return combined_grad_x, combined_grad_recv_topk_weights, event\n\n\ndef combine_forward(x: torch.Tensor, handle: Tuple, previous_event: Optional[EventOverlap] = None) -> \\\n        Tuple[torch.Tensor, EventOverlap]:\n    global _buffer\n\n    # 执行 MoE 组合操作\n    # 更高级的用法请参阅 `combine` 函数的文档。\n    combined_x, _, event = _buffer.combine(x, handle, async_finish=True, previous_event=previous_event,\n                                           allocate_on_comm_stream=previous_event is not None)\n\n    # 关于事件管理，请参阅 `EventOverlap` 类的文档。\n    return combined_x, event\n\n\ndef combine_backward(grad_combined_x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],\n                     handle: Tuple, previous_event: Optional[EventOverlap] = None) -> \\\n        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], EventOverlap]:\n    global _buffer\n\n    # MoE 组合的反向过程实际上是调度操作。\n    # 更高级的用法请参阅 `dispatch` 函数的文档。\n    grad_x, _, _, _, _, event = _buffer.dispatch(grad_combined_x, handle=handle, async_finish=True,\n                                                 previous_event=previous_event,\n                                                 allocate_on_comm_stream=previous_event is not None)\n\n    # 关于事件管理，请参阅 `EventOverlap` 类的文档。\n    return grad_x, event\n```\n\n此外，在调度函数中，我们可能不知道当前进程需要接收多少个令牌。因此，会涉及 CPU 对 GPU 接收数量信号的隐式等待，如下图所示。\n\n![normal](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdeepseek-ai_DeepEP_readme_36d9f11de5de.png)\n\n### 推理解码中的使用示例\n\n低延迟内核可以用于推理解码阶段，如下方示例代码所示。\n\n```python\nimport torch\nimport torch.distributed as dist\nfrom typing import Tuple, Optional\n\nfrom deep_ep import Buffer\n\n# 通信缓冲区（将在运行时分配）\n# 注释：低延迟内核没有 SM 控制 API\n_buffer: Optional[Buffer] = None\n\n# 您可以在框架初始化时调用此函数\ndef get_buffer(group: dist.ProcessGroup, num_max_dispatch_tokens_per_rank: int, hidden: int, num_experts: int) -> Buffer:\n    # 注意：低延迟模式会比普通模式占用更多空间\n    # 因此我们建议 `num_max_dispatch_tokens_per_rank`（解码引擎中的实际批大小）应小于 256\n    global _buffer\n    num_rdma_bytes = Buffer.get_low_latency_rdma_size_hint(num_max_dispatch_tokens_per_rank, hidden, group.size(), num_experts)\n\n    # 如果缓冲区不存在或大小不足，则分配一个新的缓冲区\n    if _buffer is None or _buffer.group != group or not _buffer.low_latency_mode or _buffer.num_rdma_bytes \u003C num_rdma_bytes:\n        # 注意：为了获得最佳性能，QP 数量**必须**等于本地专家的数量\n        assert num_experts % group.size() == 0\n        _buffer = Buffer(group, 0, num_rdma_bytes, low_latency_mode=True, num_qps_per_rank=num_experts \u002F\u002F group.size())\n    return _buffer\n\n\ndef low_latency_dispatch(hidden_states: torch.Tensor, topk_idx: torch.Tensor, num_max_dispatch_tokens_per_rank: int, num_experts: int):\n    global _buffer\n\n    # 执行 MoE 调度，兼容 CUDA 图（但您在重放时可能需要恢复部分缓冲区状态）\n    recv_hidden_states, recv_expert_count, handle, event, hook = \\\n        _buffer.low_latency_dispatch(hidden_states, topk_idx, num_max_dispatch_tokens_per_rank, num_experts,\n                                     async_finish=False, return_recv_hook=True)\n\n    # 注意：只有调用 `hook()` 时，实际张量才会被接收，\n    # 这对于双批次重叠很有用，而且**不会占用任何 SM**。\n    # 如果您不想进行重叠，请将 `return_recv_hook` 设置为 False。\n    # 稍后，您可以使用我们的 GEMM 库以这种特定格式进行计算。\n    return recv_hidden_states, recv_expert_count, handle, event, hook\n\n\ndef low_latency_combine(hidden_states: torch.Tensor,\n                        topk_idx: torch.Tensor, topk_weights: torch.Tensor, handle: Tuple):\n    global _buffer\n\n    # 执行 MoE 组合，兼容 CUDA 图（但您在重放时可能需要恢复部分缓冲区状态）\n    combined_hidden_states, event_overlap, hook = \\\n        _buffer.low_latency_combine(hidden_states, topk_idx, topk_weights, handle,\n                                    async_finish=False, return_recv_hook=True)\n\n    # 注意：行为与调度内核中描述的一致\n    return combined_hidden_states, event_overlap, hook\n```\n\n对于两个微批次的重叠，可以参考下图。借助我们的接收钩子接口，RDMA 网络流量会在后台进行，而不会占用计算部分的任何 GPU SM。不过请注意，重叠的部分是可以调整的，即注意力\u002F调度\u002FMoE\u002F组合这四个部分的执行时间不一定完全相同。您可以根据自己的工作负载调整各个阶段的设置。\n\n![low-latency](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdeepseek-ai_DeepEP_readme_6c25ee767f6d.png)\n\n## 路线图\n\n- [x] 支持 AR\n- [x] 重构低延迟模式下的 AR 代码\n- [x] 支持 A100（仅节点内）\n- [x] 为低延迟调度内核支持 BF16\n- [x] 支持 NVLink 协议用于节点内低延迟内核\n- [ ] 使用 TMA 复制代替 LD\u002FST\n  - [x] 节点内内核\n  - [ ] 节点间内核\n  - [ ] 低延迟内核\n- [ ] 实现无 SM 内核及相应重构\n- [ ] 完全移除未定义行为的 PTX 指令\n\n## 注意事项\n\n#### 更简单的潜在整体设计\n\n当前的 DeepEP 实现使用队列作为通信缓冲区，虽然节省了内存，但也引入了复杂性和潜在的死锁问题。如果您基于 DeepEP 实现自己的版本，建议使用固定大小、按最大容量分配的缓冲区，以简化设计并提升性能。有关这一替代方案的详细讨论，请参阅：https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fissues\u002F39。\n\n#### 未定义行为的 PTX 使用\n\n- 为了追求极致性能，我们发现并使用了一种未定义行为的 PTX 用法：利用只读 PTX `ld.global.nc.L1::no_allocate.L2::256B` 来**读取易失性数据**。PTX 修饰符 `.nc` 表示使用非相干缓存。但在 Hopper 架构上，通过结合 `.L1::no_allocate`，我们测试确认其正确性有保障，并且性能显著提升。我们推测原因可能是：非相干缓存与 L1 缓存统一，而 L1 修饰符不仅仅是一个提示，更是一种强约束，因此只要 L1 中没有脏数据，就能保证正确性。\n- 最初，由于 NVCC 无法自动展开易失性读取的 PTX 指令，我们尝试使用 `__ldg`（即 `ld.nc`）。即使与手动展开的易失性读取相比，它的速度也快得多（很可能得益于编译器的额外优化）。然而，结果可能会不正确或包含脏数据。在查阅 PTX 文档后，我们发现 Hopper 架构上的 L1 和非相干缓存是统一的。我们推测，`.L1::no_allocate` 可能会解决这个问题，从而促成了这一发现。\n- 如果您发现内核在某些其他平台上无法正常工作，可以在 `setup.py` 中添加 `DISABLE_AGGRESSIVE_PTX_INSTRS=1` 来禁用该功能，或者提交一个问题。\n\n#### 在您的集群上进行自动调优\n\n为了在您的集群上获得更好的性能，我们建议运行所有测试，并采用最佳的自动调优配置。默认配置是在 DeepSeek 的内部集群上优化得到的。\n\n## 许可证\n\n本代码库采用 [MIT 许可证](LICENSE) 发布，但引用 NVSHMEM 的代码（包括 `csrc\u002Fkernels\u002Fibgda_device.cuh` 和 `third-party\u002Fnvshmem.patch`）则适用 [NVSHMEM SLA](https:\u002F\u002Fdocs.nvidia.com\u002Fnvshmem\u002Fapi\u002Fsla.html)。\n\n## 实验性分支\n\n- [零拷贝](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F453)\n  - 移除了 PyTorch 张量与通信缓冲区之间的拷贝，从而显著减少了常规内核对 SM 的占用。\n  - 该 PR 由 **腾讯网络平台部** 贡献。\n- [Eager](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F437)\n  - 通过使用低延迟协议，消除了 RDMA 原子操作引入的额外 RTT 延迟。\n- [Hybrid-EP](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Ftree\u002Fhybrid-ep)\n  - 一种新的后端实现，利用 TMA 指令以最小化 SM 使用并支持更大的 NVLink 域。\n  - 针对单批次场景实现了细粒度的通信-计算重叠。\n  - 支持非 NVLink 环境下的 PCIe 内核。\n  - 支持 NVFP4 数据类型。\n- [AntGroup-Opt](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Ftree\u002Fantgroup-opt)\n  - 该优化系列由 **蚂蚁集团网络平台部** 贡献。\n  - [Normal-SMFree](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F347) 通过解耦通信内核执行与 NIC 令牌传输，将 SM 从 RDMA 路径中解放出来，从而释放 SM 用于计算。\n  - [LL-SBO](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F483) 利用信号机制将 Down GEMM 计算与 Combine Send 通信重叠，以降低端到端延迟。\n  - [LL-Layered](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F500) 采用轨道优化的转发和数据合并策略，优化跨节点 LL 算子的通信，进一步降低延迟。\n- [Mori-EP](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Ftree\u002Fmori-ep)\n  - 基于 [MORI](https:\u002F\u002Fgithub.com\u002FROCm\u002Fmori) 后端（低延迟模式）提供的 ROCm\u002FAMD GPU 支持。\n\n## 社区分叉\n\n- [uccl\u002Fuccl-ep](https:\u002F\u002Fgithub.com\u002Fuccl-project\u002Fuccl\u002Ftree\u002Fmain\u002Fep) - 允许在异构 GPU（如 Nvidia、AMD）和 NIC（如 EFA、Broadcom、CX7）上运行 DeepEP。\n- [Infrawaves\u002FDeepEP_ibrc_dual-ports_multiQP](https:\u002F\u002Fgithub.com\u002FInfrawaves\u002FDeepEP_ibrc_dual-ports_multiQP) - 在 IBRC 传输中增加了多 QP 解决方案和双端口 NIC 支持。\n- [antgroup\u002FDeepXTrace](https:\u002F\u002Fgithub.com\u002Fantgroup\u002FDeepXTrace) - 一款诊断分析工具，用于高效、精准地定位运行缓慢的节点。\n- [ROCm\u002Fmori](https:\u002F\u002Fgithub.com\u002FROCm\u002Fmori) - AMD 面向高性能 AI 工作负载的新一代通信库（如 Wide EP、KVCache 传输、集体通信）。\n\n## 引用\n\n如果您使用本代码库或认为我们的工作具有价值，请引用以下内容：\n\n```bibtex\n@misc{deepep2025,\n      title={DeepEP：高效的专家并行通信库},\n      author={Chenggang Zhao 和 Shangyan Zhou 和 Liyue Zhang 和 Chengqi Deng 和 Zhean Xu 和 Yuxuan Liu 和 Kuai Yu 和 Jiashi Li 和 Liang Zhao},\n      year={2025},\n      publisher = {GitHub},\n      howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP}},\n}\n```","# DeepEP 快速上手指南\n\nDeepEP 是专为混合专家模型（MoE）和专家并行（EP）设计的高性能通信库。它提供高吞吐、低延迟的 All-to-All GPU 内核（即 MoE 的 Dispatch 和 Combine），支持 FP8 等低精度运算，并针对 NVLink 与 RDMA 跨域传输进行了优化，适用于训练及推理场景。\n\n## 环境准备\n\n### 系统要求\n- **GPU 架构**：Ampere (SM80)、Hopper (SM90) 或其他支持 SM90 PTX ISA 的架构。\n- **Python**：3.8 及以上版本。\n- **PyTorch**：2.1 及以上版本。\n- **CUDA 版本**：\n  - SM80 GPU：CUDA 11.0+\n  - SM90 GPU：CUDA 12.3+\n- **网络硬件**：\n  - 节点内通信：需配备 NVLink。\n  - 节点间通信：需配备 RDMA 网络（推荐 InfiniBand，理论上兼容 RoCE）。\n\n### 前置依赖\nDeepEP 强依赖 **NVSHMEM**。在编译安装前，请务必先安装 NVSHMEM。\n- 请参考官方提供的 [NVSHMEM 安装指南](third-party\u002FREADME.md) 完成依赖安装。\n- 记录 NVSHMEM 的安装路径，后续安装步骤中需要使用。\n\n## 安装步骤\n\n### 1. 克隆代码\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP.git\ncd DeepEP\n```\n\n### 2. 编译与安装\n设置 `NVSHMEM_DIR` 环境变量指向你的 NVSHMEM 安装目录，然后执行安装命令。\n\n```bash\n# 将 \u002Fpath\u002Fto\u002Finstalled\u002Fnvshmem 替换为实际路径\nexport NVSHMEM_DIR=\u002Fpath\u002Fto\u002Finstalled\u002Fnvshmem\n\n# 执行安装\npython setup.py install\n```\n\n### 可选配置环境变量\n根据具体硬件环境，你可能需要设置以下变量：\n- `DISABLE_SM90_FEATURES`: 设为 `1` 可禁用 SM90 特性（适用于 SM90 设备但使用 CUDA 11 的情况）。\n- `TORCH_CUDA_ARCH_LIST`: 指定目标架构列表，例如 `export TORCH_CUDA_ARCH_LIST=\"9.0\"`。\n- `DISABLE_AGGRESSIVE_PTX_INSTRS`: 设为 `1` 可禁用激进的加载\u002F存储指令（用于解决潜在的未定义行为问题）。\n\n### 验证安装（开发模式）\n如需运行测试用例验证安装是否成功（需修改 `tests\u002Futils.py` 中的 `init_dist` 以适配集群配置）：\n```bash\npython tests\u002Ftest_intranode.py\npython tests\u002Ftest_internode.py\npython tests\u002Ftest_low_latency.py\n```\n\n## 基本使用\n\n以下示例展示了如何在模型训练或推理预填充（Prefilling）阶段使用 DeepEP 进行 MoE 的 Dispatch 和 Combine 操作。\n\n### 核心代码示例\n\n```python\nimport torch\nimport torch.distributed as dist\nfrom typing import List, Tuple, Optional, Union\n\nfrom deep_ep import Buffer, EventOverlap\n\n# 全局通信缓冲區 (运行时分配)\n_buffer: Optional[Buffer] = None\n\n# 设置使用的 SM 数量 (静态变量)\nBuffer.set_num_sms(24)\n\ndef get_buffer(group: dist.ProcessGroup, hidden_bytes: int) -> Buffer:\n    \"\"\"获取或初始化通信缓冲区\"\"\"\n    global _buffer\n\n    num_nvl_bytes, num_rdma_bytes = 0, 0\n    # 计算所需的缓冲区大小\n    for config in (Buffer.get_dispatch_config(group.size()), Buffer.get_combine_config(group.size())):\n        num_nvl_bytes = max(config.get_nvl_buffer_size_hint(hidden_bytes, group.size()), num_nvl_bytes)\n        num_rdma_bytes = max(config.get_rdma_buffer_size_hint(hidden_bytes, group.size()), num_rdma_bytes)\n\n    # 如果缓冲区不存在或大小不足，则重新分配\n    if _buffer is None or _buffer.group != group or _buffer.num_nvl_bytes \u003C num_nvl_bytes or _buffer.num_rdma_bytes \u003C num_rdma_bytes:\n        _buffer = Buffer(group, num_nvl_bytes, num_rdma_bytes)\n    return _buffer\n\ndef dispatch_forward(x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],\n                     topk_idx: torch.Tensor, topk_weights: torch.Tensor,\n                     num_experts: int, previous_event: Optional[EventOverlap] = None):\n    \"\"\"\n    执行 MoE Dispatch (前向)\n    返回: recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event\n    \"\"\"\n    global _buffer\n    \n    # 1. 计算布局 (Layout)\n    num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, previous_event = \\\n        _buffer.get_dispatch_layout(topk_idx, num_experts,\n                                    previous_event=previous_event, async_finish=True,\n                                    allocate_on_comm_stream=previous_event is not None)\n    \n    # 2. 执行 Dispatch\n    recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event = \\\n        _buffer.dispatch(x, topk_idx=topk_idx, topk_weights=topk_weights,\n                         num_tokens_per_rank=num_tokens_per_rank, num_tokens_per_rdma_rank=num_tokens_per_rdma_rank,\n                         is_token_in_rank=is_token_in_rank, num_tokens_per_expert=num_tokens_per_expert,\n                         previous_event=previous_event, async_finish=True,\n                         allocate_on_comm_stream=True)\n    \n    return recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event\n\ndef combine_forward(x: torch.Tensor, handle: Tuple, previous_event: Optional[EventOverlap] = None):\n    \"\"\"\n    执行 MoE Combine (前向)\n    返回: combined_x, event\n    \"\"\"\n    global _buffer\n    \n    combined_x, _, event = _buffer.combine(x, handle, async_finish=True, previous_event=previous_event,\n                                           allocate_on_comm_stream=previous_event is not None)\n    return combined_x, event\n\n# 使用示例流程\n# 假设已初始化 dist 进程组 group 和输入数据 x, topk_idx, topk_weights\n# buffer = get_buffer(group, hidden_bytes)\n# dispatched_data, ..., handle, event = dispatch_forward(x, topk_idx, topk_weights, num_experts)\n# ... (专家计算) ...\n# final_output, _ = combine_forward(dispatched_data, handle)\n```\n\n### 网络配置建议\n- **流量隔离**：建议通过设置环境变量 `NVSHMEM_IB_SL` 指定虚拟车道（Virtual Lanes），将普通内核、低延迟内核及其他业务的流量隔离开。\n- **自适应路由**：在高负载网络环境中建议开启自适应路由以避免拥塞；低负载环境下建议使用静态路由以降低延迟。","某大型 AI 实验室正在基于 DeepSeek-V3 架构训练千亿参数混合专家（MoE）大模型，面临多节点集群下专家并行通信效率低下的挑战。\n\n### 没有 DeepEP 时\n- **跨节点带宽瓶颈严重**：在将数据从 NVLink 域转发至 RDMA 域时，缺乏针对非对称带宽的优化，导致 internode 通信吞吐量远低于硬件理论极限，拖慢整体训练进度。\n- **推理延迟难以达标**：在进行低延迟解码推理时，传统通信内核无法充分利用纯 RDMA 通道，单次 All-to-All 通信耗时过高，无法满足实时交互需求。\n- **计算资源被通信占用**：通信过程占用了宝贵的 SM（流多处理器）算力，导致计算与通信无法有效重叠，GPU 利用率在等待数据时出现明显空洞。\n- **精度支持受限**：缺乏对 FP8 等低精度原语的高效原生支持，强行适配不仅代码复杂，还难以发挥新一代硬件的算力优势。\n\n### 使用 DeepEP 后\n- **跨节点吞吐大幅提升**：利用专为 DeepSeek-V3 优化的非对称域转发内核，在 64 路专家并行下，RDMA 瓶颈带宽稳定在 50 GB\u002Fs 以上，充分榨干 InfiniBand 网络性能。\n- **微秒级低延迟推理**：启用纯 RDMA 低延迟内核后，即使在 128 路专家并行的高复杂度场景下，通信延迟仍控制在 200 微秒以内，显著加速首字生成速度。\n- **零 SM 占用的重叠执行**：通过基于 Hook 的通信 - 计算重叠技术，在不占用任何 SM 资源的前提下实现数据预取，让 GPU 算力全程满负荷运转。\n- **原生高效支持 FP8**：直接提供针对 FP8 分发和 BF16 合并的高性能算子，无缝对接现代训练流程，既节省显存又提升了数据搬运效率。\n\nDeepEP 通过极致的通信内核优化，彻底打破了大规模 MoE 模型在多节点训练与实时推理中的通信墙，让集群算力真正转化为模型智能。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdeepseek-ai_DeepEP_b0d13fcd.png","deepseek-ai","DeepSeek","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fdeepseek-ai_04503588.png","",null,"service@deepseek.com","https:\u002F\u002Fwww.deepseek.com\u002F","https:\u002F\u002Fgithub.com\u002Fdeepseek-ai",[81,85,89,93,97],{"name":82,"color":83,"percentage":84},"Cuda","#3A4E3A",58.8,{"name":86,"color":87,"percentage":88},"Python","#3572A5",20.3,{"name":90,"color":91,"percentage":92},"C++","#f34b7d",19.2,{"name":94,"color":95,"percentage":96},"Shell","#89e051",1.2,{"name":98,"color":99,"percentage":100},"CMake","#DA3434",0.4,9135,1152,"2026-04-18T04:24:51","MIT",5,"Linux","必需 NVIDIA GPU (Ampere SM80 或 Hopper SM90 架构，如 H800)；显存大小未说明；CUDA 11.0+ (SM80) 或 CUDA 12.3+ (SM90)","未说明",{"notes":110,"python":111,"dependencies":112},"必须安装 NVSHMEM 依赖库；节点内通信需 NVLink，节点间通信需 RDMA 网络（推荐 InfiniBand，理论支持 RoCE）；建议根据网络负载配置自适应路由和虚拟车道 (VL) 以隔离流量；低延迟内核纯靠 RDMA 运行。","3.8+",[113,114],"PyTorch>=2.1","NVSHMEM",[35,14],"2026-03-27T02:49:30.150509","2026-04-18T22:32:40.566941",[119,124,129,133,138,142],{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},40935,"DeepEP 是否仍然依赖 GDRCopy (gdrdrv) 进行安装？遇到版本不匹配问题如何解决？","不再强制依赖。在 PR #201 之后，DeepEP 已经可以在不依赖 gdrcopy 的情况下构建。如果您在安装 gdrdrv 时遇到版本不匹配（如 README 指示 2.4-4 但系统为 2.5-1）或 DKMS 编译错误，可以尝试跳过该步骤直接构建项目，或者确保使用与当前内核版本匹配的 gdrdrv 版本。","https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fissues\u002F11",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},40936,"为什么设置 return_recv_hook=False 时，Dispatch 阶段的延迟较低而 Combine 阶段延迟较高？发送请求耗时为何占比较小？","这是因为发送（SEND）操作是异步的，不需要等待完成即可返回，因此提交发送请求的时间只占很小一部分。Dispatch 延迟的主要来源是接收数据的时间；如果接收的数据量小，延迟就低。相反，Combine 阶段如果需要接收大量数据，延迟就会变高。当 return_recv_hook=False 时，内核会同步等待通信完成，因此测量到的执行时间会更长，且受限于较慢的那个操作（发送或接收）。","https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fissues\u002F208",{"id":130,"question_zh":131,"answer_zh":132,"source_url":128},40937,"return_recv_hook 参数设置为 True 和 False 对内核执行时间和通信流程有什么具体影响？","当 return_recv_hook=False 时，内核会同步等待通信完全完成后才返回，因此测得的内核执行时间较长。当 return_recv_hook=True 时，内核在提交发送请求后立即退出，并在计算完成后处理数据接收（通过钩子函数）。此时，Dispatch 阶段的时间主要来自提交发送请求，Combine 阶段的时间主要来自累积接收到的数据。这种模式通常用于推理解码场景以隐藏通信延迟。",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},40938,"在双端口网卡（Dual-port RNIC）环境下运行 test_internode.py 出现挂起或超时（Timeout）怎么办？","这通常是由于双端口网卡上的 RDMA 流量分布导致原子操作（AMO）与数据写入操作的顺序无法保证引起的。解决方案是将通道（channels）进行分区：将一半通道的 RDMA 操作分配到一个 QP（对应一个端口），另一半通道的操作分配到另一个 QP。这样可以利用两个端口的同时保持必要的操作顺序。注意，不要尝试将原子操作隔离到单独的 QP，因为这会破坏其与前置数据操作的顺序依赖。","https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fissues\u002F74",{"id":139,"question_zh":140,"answer_zh":141,"source_url":128},40939,"如何在使用不同 Batch Size 的不同 RANK 上使用 test_low_latency 进行延迟分析？","可以通过为每个 RANK 分配不同的 num_tokens 参数来模拟不同负载。但在运行 test_low_latency.py 时，如果进程卡住，需检查是否因 return_recv_hook 设置不当导致同步等待死锁。建议理解 SEND 和 RECV 操作的序列化机制：若需测试不同负载下的延迟影响，应确保通信配对正确，并利用 return_recv_hook=True 来异步处理接收，避免单点阻塞。",{"id":143,"question_zh":144,"answer_zh":145,"source_url":137},40940,"DeepEP dispatch NVL receiver timeout 错误的根本原因是什么？","该错误通常发生在多节点多卡环境中，原因是接收端在指定超时周期（NUM_TIMEOUT_CYCLES）内未检测到有效的起始\u002F结束偏移量。在双端口网卡配置下，这往往是因为 RDMA 写操作和原子操作（用于移动缓冲区指针）在不同端口上的执行顺序不一致，导致接收端永远无法满足条件。解决方法是重新配置 QP 与端口的映射关系，确保同一通道内的数据和原子操作走同一个端口\u002FQP 以维持顺序。",[147],{"id":148,"version":149,"summary_zh":76,"released_at":150},324496,"v1.2.1","2025-09-16T01:48:40"]