[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-ByteDance-Seed--Triton-distributed":3,"tool-ByteDance-Seed--Triton-distributed":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":78,"owner_email":79,"owner_twitter":78,"owner_website":80,"owner_url":81,"languages":82,"stars":113,"forks":114,"last_commit_at":115,"license":116,"difficulty_score":117,"env_os":118,"env_gpu":119,"env_ram":120,"env_deps":121,"category_tags":131,"github_topics":78,"view_count":23,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":132,"updated_at":133,"faqs":134,"releases":163},2877,"ByteDance-Seed\u002FTriton-distributed","Triton-distributed","Distributed Compiler based on Triton for Parallel Systems","Triton-distributed 是由字节跳动 Seed 团队基于 OpenAI Triton 打造的分布式编译器，专为并行计算系统设计。它核心解决了大规模 AI 模型训练中“计算”与“通信”难以高效协同的痛点，通过独特的计算 - 通信重叠技术，让开发者能够编写出媲美高度优化底层库（如 Distributed-GEMM）的高效算子。\n\n这款工具主要面向需要深度定制高性能内核的 AI 基础设施工程师、系统研究人员及算法开发者。无论是使用 NVIDIA 还是 AMD GPU，用户都能利用它轻松构建复杂的分布式算子，无需在多种硬件架构间重复造轮子。其技术亮点在于支持 MegaTritonKernel 大核融合、内置细粒度的核内性能分析器（Intra-Kernel Profiler），并能显著加速 MoE（混合专家）模型及端到端推理任务。近期更新还引入了低延迟模式与令牌节省机制，进一步提升了在 H800、L20 等主流算力卡上的运行效率。如果你希望突破现有框架的性能瓶颈，亲手打造极致的分布式计算流程，Triton-distributed 将是一个强大且灵活的选择。","\u003Cdiv align=\"center\">\n 👋 Hi, everyone!\n    \u003Cbr>\n    We are \u003Cb>ByteDance Seed team.\u003C\u002Fb>\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n  You can get to know us better through the following channels👇\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fteam.doubao.com\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-%231e37ff?style=for-the-badge&logo=bytedance&logoColor=white\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F93481cda-a7f3-47f3-b333-fe6b3da86b78\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-07C160?style=for-the-badge&logo=wechat&logoColor=white\">\u003C\u002Fa>\n \u003Ca href=\"https:\u002F\u002Fwww.xiaohongshu.com\u002Fuser\u002Fprofile\u002F668e7e15000000000303157d?xsec_token=ABl2-aqekpytY6A8TuxjrwnZskU-6BsMRE_ufQQaSAvjc%3D&xsec_source=pc_search\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FXiaohongshu-%23FF2442?style=for-the-badge&logo=xiaohongshu&logoColor=white\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.zhihu.com\u002Forg\u002Fdou-bao-da-mo-xing-tuan-dui\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fzhihu-%230084FF?style=for-the-badge&logo=zhihu&logoColor=white\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n![seed logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_6a0d1d7b7f37.png)\n\n# Triton-distributed\n\n\u003C!-- \u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fbytedance\u002Fflux\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTriton-distributed-Project Page-yellow\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002Fxxxx.xxxx\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTriton-distributed-Tech Report-red\">\u003C\u002Fa>\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd3fcb3bf-466b-4efe-8c3f-5f85258202ae\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTriton-distributed-Wechat Communication Group-07C160\">\u003C\u002Fa>\n  \u003Ca href=\"XXX\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-blue\">\u003C\u002Fa>\n\u003C\u002Fp> -->\n\n[Original Triton README](https:\u002F\u002Fgithub.com\u002Ftriton-lang\u002Ftriton\u002Fblob\u002Fmain\u002FREADME.md) | [README in Chinese](README-cn.md)\n\nTriton-distributed is a distributed compiler designed for computation-communication overlapping, which is based on OpenAI Triton.\n\nUsing Triton-distributed, programmers are able to develop efficient kernels comparable to highly-optimized libraries (including [Distributed-GEMM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F65_distributed_gemm) and [FLUX](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Fflux\u002Fblob\u002Fmain\u002FREADME.md)).\nTriton-distributed currently mainly targets Nvidia GPU and AMD GPU. It can also be ported to other hardware platforms.\nFeel free to contact us if you want to use Triton-distributed on your own hardware.\n\n## News\n- 12\u002F22\u002F2025 ✨✨✨: Updated EP functions, support low-latency mode, token saving, and Mega-EP.\n- 21\u002F10\u002F2025 🔥🔥🔥: Triton-distributed is presented at [Triton Conference 2025](https:\u002F\u002Ftritonconference.eventbuilder.com\u002FTritonDeveloperConference?ref=TritonDeveloperConference), see the [talk](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLc_vA1r0qoiQqCdWFDUDqI90oY5EjfGuO) for details.\n- 09\u002F03\u002F2025 ✨✨✨: Introduced Intra-Kernel Profiler, See the [doc](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fdocs\u002Fgetting-started\u002Fprofiler\u002Fintra_kernel_profiler.md) for details.\n- 08\u002F24\u002F2025 ⚡⚡⚡: Support inference acceleration for [ByteDance-Seed\u002FSeed-OSS-36B-Instruct](https:\u002F\u002Fhuggingface.co\u002FByteDance-Seed\u002FSeed-OSS-36B-Instruct), achieving a 1.33x speedup.\n- 08\u002F13\u002F2025 ✨✨✨: Introduced the MegaTritonKernel and provided a Qwen3 TP demo on H20\u002FH800, See the [doc](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fdocs\u002Fgetting-started\u002Fmegakernel\u002Fmegakernel.md) for details.\n- 08\u002F06\u002F2025 ✨✨✨: Support GEMM+AllReduce on H800 and support MoE operators on L20, see [GEMM+AR Test](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fpython\u002Ftriton_dist\u002Ftest\u002Fnvidia\u002Ftest_gemm_ar.py) and [MOE Test](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fpython\u002Ftriton_dist\u002Ftest\u002Fnvidia\u002Ftest_moe_reduce_rs.py) for detail.\n- 07\u002F24\u002F2025 🤖🤖🤖: Introduced end-to-end inference acceleration demo with unified support for both NVIDIA and AMD GPUs. See the [doc](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fdocs\u002Fgetting-started\u002Fe2e\u002Fe2e_dense.md) for details.\n- 07\u002F11\u002F2025 ✨✨✨: Fast AllReduce implemented with Triton-distributed, see [AllReduce Test](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fpython\u002Ftriton_dist\u002Ftest\u002Fnvidia\u002Ftest_allreduce.py).\n- 07\u002F11\u002F2025 ✨✨✨: Improved MoE operators for tensor parallel. See [AG+MoE Test](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fpython\u002Ftriton_dist\u002Ftest\u002Fnvidia\u002Ftest_ag_moe.py) and [MoE+RS Test](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fpython\u002Ftriton_dist\u002Ftest\u002Fnvidia\u002Ftest_moe_reduce_rs.py).\n- 07\u002F11\u002F2025 ✨✨✨: Triton 3.4 support with NVSHMEM4py ([MR](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fpull\u002F54)). `pip install` is also supported without any need to modify NVSHMEM code.\n- 05\u002F12\u002F2025 🚀🚀🚀: Our paper `TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives` accepted by MLSys 2025.\n\n## Getting started\n\n### Install Triton-distributed\n\n#### Method 1. From source\n\nSee [build from source](docs\u002Fbuild.md).\n\n#### Method 2. Using pip\n\nPrepare PyTorch container\n\n```sh\ndocker run --name triton-dist --ipc=host --network=host --privileged --cap-add=SYS_ADMIN --shm-size=10g --gpus=all -itd nvcr.io\u002Fnvidia\u002Fpytorch:25.04-py3 \u002Fbin\u002Fbash\ndocker exec -it triton-dist \u002Fbin\u002Fbash\n```\n\nInstall Dependencies\n\n```sh\npip3 install cuda.core==0.2.0 nvidia-nvshmem-cu12==3.3.9 Cython==0.29.24 nvshmem4py-cu12==0.1.2\npip3 install cuda-python==12.4 setuptools==69.0.0 wheel pybind11\n```\n\nThen, pip install triton-dist.\n```sh\n# Remove triton installed with torch\npip uninstall triton\npip uninstall triton_dist # remove previous triton-dist\nrm -rf \u002Fusr\u002Flocal\u002Flib\u002Fpython3.12\u002Fdist-packages\u002Ftriton\n# Install Triton-distributed\nVERSION=v0.0.2 # use the latest version\npip install https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Freleases\u002Fdownload\u002F${VERSION}\u002Ftriton_dist-3.4.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl\n```\n\n\n### How to use Triton-distributed\nTriton-distributed provides a set of easy-to use primitives to support the development of distributed compute-communication overlapping kernels. The primitives are divided into low-level primitives and high-level primitives. Currently, we have released our low-level primitives, and we plan to release high-level primitives in future.\n\n[Triton-distributed Primitives](docs\u002Fprimitives.md)\n\nUsing these primitives, users can program communication kernels easily. For example, a low-latency AllToAll (with better latency than [DeepEP](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP) for inference) is shown below.\nThe performance of this example on 32 H800 GPUs is 137us (128 tokens per rank, topk=8, hidden_size=7168, dtype=fp8), while DeepEP is 182 us (note: DeepEP doesn't use NVLink for inference).\n```py\n@triton_dist.jit\ndef all_to_all_kernel(\n    data_src,\n    data_dst,\n    splits_src,\n    splits_dst,\n    signal,\n    splits_cumsum,\n    scale_src,\n    scale_dst,\n    rank: int,\n    call_count: int,\n    WITH_SCALE: tl.constexpr,\n    WORLD_SIZE: tl.constexpr,\n    HIDDEN: tl.constexpr,\n    MAX_M: tl.constexpr,\n    EXPERTS_PER_RANK: tl.constexpr,\n    NUM_TOT_EXPERTS: tl.constexpr,\n    ELEMENT_SIZE: tl.constexpr = 2,\n    SCALE_ELEMENT_SIZE: tl.constexpr = 4,\n):\n    pid = tl.program_id(0)\n    threadidx = tid(axis=0)\n\n    exp_st = pid * EXPERTS_PER_RANK\n    exp_ed = exp_st + EXPERTS_PER_RANK\n\n    m_st = tl.load(splits_cumsum + exp_st)\n    m_ed = tl.load(splits_cumsum + exp_ed)\n    num_rows_cur_block = m_ed - m_st\n\n    src_off = m_st\n    dst_off = rank * MAX_M\n\n    split_src_ptr = splits_src + exp_st\n    off0 = exp_st + tl.arange(0, EXPERTS_PER_RANK)\n    off1 = exp_st + tl.arange(0, EXPERTS_PER_RANK) + 1\n    cumsum_sts = tl.load(splits_cumsum + off0)\n    cumsum_eds = tl.load(splits_cumsum + off1)\n    tl.store(split_src_ptr + tl.arange(0, EXPERTS_PER_RANK), cumsum_eds - cumsum_sts)\n\n    act_pos = call_count % 2\n    data_dst_ptr = data_dst + act_pos * WORLD_SIZE * MAX_M * HIDDEN + dst_off * HIDDEN\n    split_dst_ptr = splits_dst + act_pos * NUM_TOT_EXPERTS + rank * EXPERTS_PER_RANK\n    signal_ptr = signal + act_pos * WORLD_SIZE + rank\n\n    libshmem_device.putmem_nbi_block(\n        data_dst_ptr,\n        data_src + src_off * HIDDEN,\n        num_rows_cur_block * HIDDEN * ELEMENT_SIZE,\n        pid,\n    )\n    libshmem_device.putmem_nbi_block(\n        split_dst_ptr,\n        split_src_ptr,\n        EXPERTS_PER_RANK * 4,  # now we use `int32` for splits\n        pid,\n    )\n    if WITH_SCALE:\n        scale_dst_ptr = scale_dst + act_pos * WORLD_SIZE * MAX_M + dst_off\n        libshmem_device.putmem_signal_nbi_block(\n            scale_dst_ptr,\n            scale_src + src_off,\n            num_rows_cur_block * SCALE_ELEMENT_SIZE,\n            signal_ptr,\n            call_count,\n            libshmem_device.NVSHMEM_SIGNAL_SET,\n            pid,\n        )\n\n    libshmem_device.fence()\n    if threadidx == 0:\n        if not WITH_SCALE:\n            libshmem_device.signal_op(\n                signal_ptr,\n                call_count,\n                libshmem_device.NVSHMEM_SIGNAL_SET,\n                pid,\n            )\n        libshmem_device.signal_wait_until(\n            signal + act_pos * WORLD_SIZE + pid,\n            libshmem_device.NVSHMEM_CMP_EQ,\n            call_count,\n        )\n```\n\nAlso, users can combine the communication part with computation part to design overlapping kernels. We have provided example implementations in `python\u002Ftriton_dist\u002Fkernels`.\n\n## Performance\nTriton-distributed can achieve comparable or better performance than hand-tuned libraries.\n\n\n### AllGather GEMM on single node of H800x8\n![Ag-GEMM-inter-node](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_43ba98affa7e.png)\n\n### GEMM ReduceScatter on single node of H800x8\n![Ag-GEMM-inter-node](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_453a03d17e69.png)\n\n### AllGather GEMM on 2 nodes of H800x8\n![Ag-GEMM-inter-node](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_41710ed9f960.png)\n\n### GEMM ReduceScatter on 2 nodes of H800x8\n![GEMM-Rs-inter-node](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_2ed55e9c71ad.png)\n\n### Scaling of Distributed Flash-Decode from 1 GPU to 32 GPUs\nThe batch size is 1 (one query) for decoding.\n![flash-decode-inter-node](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_581eb6f2b3e4.png)\n\n### Performance on Other Platforms\n[AMD GPUs](docs\u002Famd-perf.md)\n\n\n## Roadmaps\n\n### Functionalities\n- [x] Release low-level primitives\n- [ ] Release high-level primitives\n- [x] Tutorials\n- [x] Pre-built binary\n\n### Kernels\n- [x] Release single-node GEMM TP overlapping kernels\n- [x] Release single-node MoE TP overlapping kernels\n- [x] Release single-node distributed Flash-Decoding kernels\n- [ ] Release single-node MoE EP overlapping kernels\n- [x] Release cross-node GEMM TP overlapping kernels\n- [x] Release cross-node MoE TP overlapping kernels\n- [x] Release cross-node distributed Flash-Decoding kernels\n- [x] Release cross-node EP all-to-all kernels (similar to [DeepEP](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP))\n- [x] Provide tutorials for kernel implementation\n\n### Backends\nComputation\n- [x] Nvidia SM90a support\n- [x] Nvidia SM80 support\n- [x] Nvidia SM89 support\n- [x] AMD CDNA3 support\n\nCommunication\n- [x] NVLink\n- [x] IB\n- [x] PCIe\n\n### Performance\n- [x] Performance report\n\n## License\nThe Triton-distributed project is under MIT license.\nPart of our code is under Apache-2.0 License:\n- `python\u002Ftriton_dist\u002Fkernels\u002Fflash_decode.py`\n\n\n## Citation\nIf you use Triton-distributed in a scientific publication, we encourage you to add the following reference to the related papers:\n```bibtex\n@misc{zheng2025tritondistributed,\n      title={Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler},\n      author={Size Zheng and Wenlei Bao and Qi Hou and Xuegui Zheng and Jin Fang and Chenhui Huang and Tianqi Li and Haojie Duanmu and Renze Chen and Ruifan Xu and Yifan Guo and Ningxin Zheng and Ziheng Jiang and Xinyi Di and Dongyang Wang and Jianxi Ye and Haibin Lin and Li-Wen Chang and Liqiang Lu and Yun Liang and Jidong Zhai and Xin Liu},\n      year={2025},\n      eprint={2504.19442},\n      archivePrefix={arXiv},\n      primaryClass={cs.DC},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19442},\n}\n\n@article{zheng2025tilelink,\n  title={Tilelink: Generating efficient compute-communication overlapping kernels using tile-centric primitives},\n  author={Zheng, Size and Fang, Jin and Zheng, Xuegui and Hou, Qi and Bao, Wenlei and Zheng, Ningxin and Jiang, Ziheng and Wang, Dongyang and Ye, Jianxi and Lin, Haibin and others},\n  journal={arXiv preprint arXiv:2503.20313},\n  year={2025}\n}\n```\n\n# About [ByteDance Seed Team](https:\u002F\u002Fteam.doubao.com\u002F)\n\nFounded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.\n\n# Discussion and Contribution\nPlease use issues or pull requests for discussion and contribution (see [CONTRIBUTING.md](CONTRIBUTING.md)).\n","\u003Cdiv align=\"center\">\n 👋 大家好！\n    \u003Cbr>\n    我们是\u003Cb>字节跳动Seed团队。\u003C\u002Fb>\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n  您可以通过以下渠道进一步了解我们👇\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fteam.doubao.com\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-%231e37ff?style=for-the-badge&logo=bytedance&logoColor=white\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F93481cda-a7f3-47f3-b333-fe6b3da86b78\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-07C160?style=for-the-badge&logo=wechat&logoColor=white\">\u003C\u002Fa>\n \u003Ca href=\"https:\u002F\u002Fwww.xiaohongshu.com\u002Fuser\u002Fprofile\u002F668e7e15000000000303157d?xsec_token=ABl2-aqekpytY6A8TuxjrwnZskU-6BsMRE_ufQQaSAvjc%3D&xsec_source=pc_search\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FXiaohongshu-%23FF2442?style=for-the-badge&logo=xiaohongshu&logoColor=white\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.zhihu.com\u002Forg\u002Fdou-bao-da-mo-xing-tuan-dui\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fzhihu-%230084FF?style=for-the-badge&logo=zhihu&logoColor=white\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n![seed logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_6a0d1d7b7f37.png)\n\n# Triton-distributed\n\n\u003C!-- \u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fbytedance\u002Fflux\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTriton-distributed-Project Page-yellow\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002Fxxxx.xxxx\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTriton-distributed-Tech Report-red\">\u003C\u002Fa>\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd3fcb3bf-466b-4efe-8c3f-5f85258202ae\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTriton-distributed-Wechat Communication Group-07C160\">\u003C\u002Fa>\n  \u003Ca href=\"XXX\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-blue\">\u003C\u002Fa>\n\u003C\u002Fp> -->\n\n[原版 Triton README](https:\u002F\u002Fgithub.com\u002Ftriton-lang\u002Ftriton\u002Fblob\u002Fmain\u002FREADME.md) | [中文版 README](README-cn.md)\n\nTriton-distributed 是一款基于 OpenAI Triton 的分布式编译器，专为计算与通信重叠设计。\n\n借助 Triton-distributed，程序员可以开发出媲美高度优化库的高效内核（包括 [Distributed-GEMM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F65_distributed_gemm) 和 [FLUX](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Fflux\u002Fblob\u002Fmain\u002FREADME.md)）。目前，Triton-distributed 主要面向 Nvidia GPU 和 AMD GPU，未来也可移植到其他硬件平台。如果您希望在自己的硬件上使用 Triton-distributed，请随时联系我们。\n\n## 新闻\n- 2025年12月22日 ✨✨✨：更新了 EP 函数，支持低延迟模式、令牌节省和 Mega-EP。\n- 2025年10月21日 🔥🔥🔥：Triton-distributed 在 [2025 年 Triton 开发者大会](https:\u002F\u002Ftritonconference.eventbuilder.com\u002FTritonDeveloperConference?ref=TritonDeveloperConference) 上进行了展示，详情请参阅 [演讲视频](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLc_vA1r0qoiQqCdWFDUDqI90oY5EjfGuO)。\n- 2025年3月9日 ✨✨✨：推出了内核级性能分析器，详情请参阅 [文档](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fdocs\u002Fgetting-started\u002Fprofiler\u002Fintra_kernel_profiler.md)。\n- 2025年8月24日 ⚡⚡⚡：支持对 [ByteDance-Seed\u002FSeed-OSS-36B-Instruct](https:\u002F\u002Fhuggingface.co\u002FByteDance-Seed\u002FSeed-OSS-36B-Instruct) 进行推理加速，速度提升至 1.33 倍。\n- 2025年8月13日 ✨✨✨：引入了 MegaTritonKernel，并在 H20\u002FH800 上提供了 Qwen3 TP 示例，详情请参阅 [文档](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fdocs\u002Fgetting-started\u002Fmegakernel\u002Fmegakernel.md)。\n- 2025年8月6日 ✨✨✨：支持在 H800 上进行 GEMM+AllReduce 操作，并在 L20 上支持 MoE 算子，详细信息请参见 [GEMM+AR 测试](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fpython\u002Ftriton_dist\u002Ftest\u002Fnvidia\u002Ftest_gemm_ar.py) 和 [MOE 测试](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fpython\u002Ftriton_dist\u002Ftest\u002Fnvidia\u002Ftest_moe_reduce_rs.py)。\n- 2025年7月24日 🤖🤖🤖：推出了端到端推理加速演示，统一支持 NVIDIA 和 AMD GPU，详情请参阅 [文档](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fdocs\u002Fgetting-started\u002Fe2e\u002Fe2e_dense.md)。\n- 2025年7月11日 ✨✨✨：利用 Triton-distributed 实现了快速 AllReduce，详情请参见 [AllReduce 测试](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fpython\u002Ftriton_dist\u002Ftest\u002Fnvidia\u002Ftest_allreduce.py)。\n- 2025年7月11日 ✨✨✨：改进了张量并行下的 MoE 算子，详情请参见 [AG+MoE 测试](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fpython\u002Ftriton_dist\u002Ftest\u002Fnvidia\u002Ftest_ag_moe.py) 和 [MoE+RS 测试](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fblob\u002Fmain\u002Fpython\u002Ftriton_dist\u002Ftest\u002Fnvidia\u002Ftest_moe_reduce_rs.py)。\n- 2025年7月11日 ✨✨✨：支持 Triton 3.4，并集成 NVSHMEM4py（[MR](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fpull\u002F54)）。无需修改 NVSHMEM 代码即可通过 `pip install` 安装。\n- 2025年5月12日 🚀🚀🚀：我们的论文《TileLink：利用以块为中心的原语生成高效的计算-通信重叠内核》已被 MLSys 2025 接受。\n\n## 入门指南\n\n### 安装 Triton-distributed\n\n#### 方法一：从源码编译\n\n请参阅 [从源码构建](docs\u002Fbuild.md)。\n\n#### 方法二：使用 pip 安装\n\n准备 PyTorch 容器\n\n```sh\ndocker run --name triton-dist --ipc=host --network=host --privileged --cap-add=SYS_ADMIN --shm-size=10g --gpus=all -itd nvcr.io\u002Fnvidia\u002Fpytorch:25.04-py3 \u002Fbin\u002Fbash\ndocker exec -it triton-dist \u002Fbin\u002Fbash\n```\n\n安装依赖项\n\n```sh\npip3 install cuda.core==0.2.0 nvidia-nvshmem-cu12==3.3.9 Cython==0.29.24 nvshmem4py-cu12==0.1.2\npip3 install cuda-python==12.4 setuptools==69.0.0 wheel pybind11\n```\n\n然后，使用 pip 安装 Triton-distributed。\n```sh\n# 移除随 torch 安装的 Triton\npip uninstall triton\npip uninstall triton_dist # 移除之前的 Triton-distributed\nrm -rf \u002Fusr\u002Flocal\u002Flib\u002Fpython3.12\u002Fdist-packages\u002Ftriton\n# 安装 Triton-distributed\nVERSION=v0.0.2 # 使用最新版本\npip install https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Freleases\u002Fdownload\u002F${VERSION}\u002Ftriton_dist-3.4.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl\n```\n\n### 如何使用 Triton-distributed\nTriton-distributed 提供了一组易于使用的原语，以支持分布式计算与通信重叠内核的开发。这些原语分为低层原语和高层原语。目前，我们已发布了低层原语，并计划在未来发布高层原语。\n\n[Triton-distributed 原语](docs\u002Fprimitives.md)\n\n借助这些原语，用户可以轻松地编写通信内核。例如，下面展示了一个低延迟的 AllToAll 操作（其延迟优于用于推理的 [DeepEP](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP)）。\n在 32 张 H800 GPU 上，该示例的性能为 137 微秒（每张卡 128 个标记，topk=8，隐藏层大小为 7168，数据类型为 fp8），而 DeepEP 的性能则为 182 微秒（注意：DeepEP 在推理时不使用 NVLink）。\n```py\n@triton_dist.jit\ndef all_to_all_kernel(\n    data_src,\n    data_dst,\n    splits_src,\n    splits_dst,\n    signal,\n    splits_cumsum,\n    scale_src,\n    scale_dst,\n    rank: int,\n    call_count: int,\n    WITH_SCALE: tl.constexpr,\n    WORLD_SIZE: tl.constexpr,\n    HIDDEN: tl.constexpr,\n    MAX_M: tl.constexpr,\n    EXPERTS_PER_RANK: tl.constexpr,\n    NUM_TOT_EXPERTS: tl.constexpr,\n    ELEMENT_SIZE: tl.constexpr = 2,\n    SCALE_ELEMENT_SIZE: tl.constexpr = 4,\n):\n    pid = tl.program_id(0)\n    threadidx = tid(axis=0)\n\n    exp_st = pid * EXPERTS_PER_RANK\n    exp_ed = exp_st + EXPERTS_PER_RANK\n\n    m_st = tl.load(splits_cumsum + exp_st)\n    m_ed = tl.load(splits_cumsum + exp_ed)\n    num_rows_cur_block = m_ed - m_st\n\n    src_off = m_st\n    dst_off = rank * MAX_M\n\n    split_src_ptr = splits_src + exp_st\n    off0 = exp_st + tl.arange(0, EXPERTS_PER_RANK)\n    off1 = exp_st + tl.arange(0, EXPERTS_PER_RANK) + 1\n    cumsum_sts = tl.load(splits_cumsum + off0)\n    cumsum_eds = tl.load(splits_cumsum + off1)\n    tl.store(split_src_ptr + tl.arange(0, EXPERTS_PER_RANK), cumsum_eds - cumsum_sts)\n\n    act_pos = call_count % 2\n    data_dst_ptr = data_dst + act_pos * WORLD_SIZE * MAX_M * HIDDEN + dst_off * HIDDEN\n    split_dst_ptr = splits_dst + act_pos * NUM_TOT_EXPERTS + rank * EXPERTS_PER_RANK\n    signal_ptr = signal + act_pos * WORLD_SIZE + rank\n\n    libshmem_device.putmem_nbi_block(\n        data_dst_ptr,\n        data_src + src_off * HIDDEN,\n        num_rows_cur_block * HIDDEN * ELEMENT_SIZE,\n        pid,\n    )\n    libshmem_device.putmem_nbi_block(\n        split_dst_ptr,\n        split_src_ptr,\n        EXPERTS_PER_RANK * 4,  # 现在我们使用 `int32` 来表示分割信息\n        pid,\n    )\n    if WITH_SCALE:\n        scale_dst_ptr = scale_dst + act_pos * WORLD_SIZE * MAX_M + dst_off\n        libshmem_device.putmem_signal_nbi_block(\n            scale_dst_ptr,\n            scale_src + src_off,\n            num_rows_cur_block * SCALE_ELEMENT_SIZE,\n            signal_ptr,\n            call_count,\n            libshmem_device.NVSHMEM_SIGNAL_SET,\n            pid,\n        )\n\n    libshmem_device.fence()\n    if threadidx == 0:\n        if not WITH_SCALE:\n            libshmem_device.signal_op(\n                signal_ptr,\n                call_count,\n                libshmem_device.NVSHMEM_SIGNAL_SET,\n                pid,\n            )\n        libshmem_device.signal_wait_until(\n            signal + act_pos * WORLD_SIZE + pid,\n            libshmem_device.NVSHMEM_CMP_EQ,\n            call_count,\n        )\n```\n\n此外，用户还可以将通信部分与计算部分结合，设计出重叠内核。我们在 `python\u002Ftriton_dist\u002Fkernels` 中提供了示例实现。\n\n## 性能\nTriton-distributed 的性能可以达到与手工优化库相当甚至更好的水平。\n\n### 单节点 H800x8 上的 AllGather GEMM\n![Ag-GEMM-inter-node](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_43ba98affa7e.png)\n\n### 单节点 H800x8 上的 GEMM ReduceScatter\n![Ag-GEMM-inter-node](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_453a03d17e69.png)\n\n### 两节点 H800x8 上的 AllGather GEMM\n![Ag-GEMM-inter-node](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_41710ed9f960.png)\n\n### 两节点 H800x8 上的 GEMM ReduceScatter\n![GEMM-Rs-inter-node](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_2ed55e9c71ad.png)\n\n### 分布式 Flash-Decode 从 1 张 GPU 扩展到 32 张 GPU 的性能\n解码时的批次大小为 1（单个查询）。\n![flash-decode-inter-node](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_readme_581eb6f2b3e4.png)\n\n### 其他平台上的性能\n[AMD GPU](docs\u002Famd-perf.md)\n\n\n## 路线图\n\n### 功能\n- [x] 发布低层原语\n- [ ] 发布高层原语\n- [x] 教程\n- [x] 预编译二进制文件\n\n### 内核\n- [x] 发布单节点 GEMM TP 重叠内核\n- [x] 发布单节点 MoE TP 重叠内核\n- [x] 发布单节点分布式 Flash-Decoding 内核\n- [ ] 发布单节点 MoE EP 重叠内核\n- [x] 发布跨节点 GEMM TP 重叠内核\n- [x] 发布跨节点 MoE TP 重叠内核\n- [x] 发布跨节点分布式 Flash-Decoding 内核\n- [x] 发布跨节点 EP 全对全通信内核（类似于 [DeepEP](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP)）\n- [x] 提供内核实现教程\n\n### 后端\n计算\n- [x] 支持 Nvidia SM90a\n- [x] 支持 Nvidia SM80\n- [x] 支持 Nvidia SM89\n- [x] 支持 AMD CDNA3\n\n通信\n- [x] NVLink\n- [x] IB\n- [x] PCIe\n\n### 性能\n- [x] 性能报告\n\n## 许可证\nTriton-distributed 项目采用 MIT 许可证。我们的部分代码采用 Apache-2.0 许可证：\n- `python\u002Ftriton_dist\u002Fkernels\u002Fflash_decode.py`\n\n\n## 引用\n如果您在科学出版物中使用 Triton-distributed，我们鼓励您在相关论文中添加以下参考文献：\n```bibtex\n@misc{zheng2025tritondistributed,\n      title={Triton-distributed: 使用 Triton 编译器在分布式 AI 系统上编程重叠内核},\n      author={Size Zheng、Wenlei Bao、Qi Hou、Xuegui Zheng、Jin Fang、Chenhui Huang、Tianqi Li、Haojie Duanmu、Renze Chen、Ruifan Xu、Yifan Guo、Ningxin Zheng、Ziheng Jiang、Xinyi Di、Dongyang Wang、Jianxi Ye、Haibin Lin、Li-Wen Chang、Liqiang Lu、Yun Liang、Jidong Zhai、Xin Liu},\n      year={2025},\n      eprint={2504.19442},\n      archivePrefix={arXiv},\n      primaryClass={cs.DC},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19442},\n}\n\n@article{zheng2025tilelink,\n  title={Tilelink: 使用以 Tile 为中心的原语生成高效的计算-通信重叠内核},\n  author={Zheng, Size、Fang, Jin、Zheng, Xuegui、Hou, Qi、Bao, Wenlei、Zheng, Ningxin、Jiang, Ziheng、Wang, Dongyang、Ye, Jianxi、Lin, Haibin 等},\n  journal={arXiv 预印本 arXiv:2503.20313},\n  year={2025}\n}\n```\n\n# 关于 [字节跳动 Seed 团队](https:\u002F\u002Fteam.doubao.com\u002F)\n\n字节跳动 Seed 团队成立于 2023 年，致力于打造业界最先进的 AI 基础模型。该团队立志成为世界一流的研究团队，为科学和社会的进步做出重大贡献。\n\n# 讨论与贡献\n请使用议题或拉取请求进行讨论和贡献（参见 [CONTRIBUTING.md](CONTRIBUTING.md)）。","# Triton-distributed 快速上手指南\n\nTriton-distributed 是由字节跳动 Seed 团队开发的分布式编译器，基于 OpenAI Triton 构建。它专为**计算与通信重叠**（Compute-Communication Overlapping）设计，支持开发者编写高效的内核，性能可媲美高度优化的手工调优库（如 Distributed-GEMM、FLUX）。目前主要支持 NVIDIA GPU 和 AMD GPU。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐 Ubuntu)\n- **GPU**: \n  - NVIDIA: 支持 SM80, SM89, SM90a 架构 (如 A100, H800, H100 等)\n  - AMD: 支持 CDNA3 架构\n- **容器环境**: 推荐使用官方 PyTorch Docker 镜像\n\n### 前置依赖\n在开始安装前，请确保已安装以下基础软件：\n- Docker (用于容器化部署)\n- NVIDIA Driver 及 CUDA Toolkit (若宿主机直接安装)\n- Python 3.12 (推荐在容器中运行)\n\n## 安装步骤\n\n推荐使用 **Docker 容器 + pip** 的方式进行安装，以避免环境冲突。\n\n### 1. 启动 PyTorch 容器\n拉取并运行包含必要驱动环境的 PyTorch 容器：\n\n```sh\ndocker run --name triton-dist --ipc=host --network=host --privileged --cap-add=SYS_ADMIN --shm-size=10g --gpus=all -itd nvcr.io\u002Fnvidia\u002Fpytorch:25.04-py3 \u002Fbin\u002Fbash\ndocker exec -it triton-dist \u002Fbin\u002Fbash\n```\n\n### 2. 安装依赖库\n在容器内安装 `cuda.core`, `nvshmem` 等关键依赖：\n\n```sh\npip3 install cuda.core==0.2.0 nvidia-nvshmem-cu12==3.3.9 Cython==0.29.24 nvshmem4py-cu12==0.1.2\npip3 install cuda-python==12.4 setuptools==69.0.0 wheel pybind11\n```\n\n### 3. 安装 Triton-distributed\n首先移除可能随 PyTorch 预装的旧版 Triton，然后安装最新版本的 Triton-distributed：\n\n```sh\n# 卸载冲突的旧版本\npip uninstall triton\npip uninstall triton_dist\nrm -rf \u002Fusr\u002Flocal\u002Flib\u002Fpython3.12\u002Fdist-packages\u002Ftriton\n\n# 安装 Triton-distributed (当前版本 v0.0.2)\nVERSION=v0.0.2\npip install https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Freleases\u002Fdownload\u002F${VERSION}\u002Ftriton_dist-3.4.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl\n```\n\n> **注意**: 如果网络访问 GitHub 较慢，可尝试配置国内 pip 镜像源（如 `pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple ...`），但 wheel 文件链接需保持原样或下载后本地安装。\n\n## 基本使用\n\nTriton-distributed 提供了一套底层原语（Primitives），用于开发分布式的计算通信重叠内核。以下是一个**低延迟 AllToAll 通信内核**的最小化示例，展示了如何使用 `@triton_dist.jit` 装饰器和 `libshmem_device` 进行显存直接通信。\n\n```python\nimport triton\nimport triton.language as tl\nimport triton_dist\n\n@triton_dist.jit\ndef all_to_all_kernel(\n    data_src,\n    data_dst,\n    splits_src,\n    splits_dst,\n    signal,\n    splits_cumsum,\n    scale_src,\n    scale_dst,\n    rank: int,\n    call_count: int,\n    WITH_SCALE: tl.constexpr,\n    WORLD_SIZE: tl.constexpr,\n    HIDDEN: tl.constexpr,\n    MAX_M: tl.constexpr,\n    EXPERTS_PER_RANK: tl.constexpr,\n    NUM_TOT_EXPERTS: tl.constexpr,\n    ELEMENT_SIZE: tl.constexpr = 2,\n    SCALE_ELEMENT_SIZE: tl.constexpr = 4,\n):\n    pid = tl.program_id(0)\n    # 获取线程索引 (假设 tid 函数已在上下文中定义或通过 triton 原生方式获取)\n    threadidx = tid(axis=0) \n\n    exp_st = pid * EXPERTS_PER_RANK\n    exp_ed = exp_st + EXPERTS_PER_RANK\n\n    m_st = tl.load(splits_cumsum + exp_st)\n    m_ed = tl.load(splits_cumsum + exp_ed)\n    num_rows_cur_block = m_ed - m_st\n\n    src_off = m_st\n    dst_off = rank * MAX_M\n\n    # 计算分片偏移\n    split_src_ptr = splits_src + exp_st\n    off0 = exp_st + tl.arange(0, EXPERTS_PER_RANK)\n    off1 = exp_st + tl.arange(0, EXPERTS_PER_RANK) + 1\n    cumsum_sts = tl.load(splits_cumsum + off0)\n    cumsum_eds = tl.load(splits_cumsum + off1)\n    tl.store(split_src_ptr + tl.arange(0, EXPERTS_PER_RANK), cumsum_eds - cumsum_sts)\n\n    act_pos = call_count % 2\n    data_dst_ptr = data_dst + act_pos * WORLD_SIZE * MAX_M * HIDDEN + dst_off * HIDDEN\n    split_dst_ptr = splits_dst + act_pos * NUM_TOT_EXPERTS + rank * EXPERTS_PER_RANK\n    signal_ptr = signal + act_pos * WORLD_SIZE + rank\n\n    # 执行非阻塞内存复制 (Put)\n    libshmem_device.putmem_nbi_block(\n        data_dst_ptr,\n        data_src + src_off * HIDDEN,\n        num_rows_cur_block * HIDDEN * ELEMENT_SIZE,\n        pid,\n    )\n    libshmem_device.putmem_nbi_block(\n        split_dst_ptr,\n        split_src_ptr,\n        EXPERTS_PER_RANK * 4,\n        pid,\n    )\n    \n    # 可选：处理 Scale 数据并发送信号\n    if WITH_SCALE:\n        scale_dst_ptr = scale_dst + act_pos * WORLD_SIZE * MAX_M + dst_off\n        libshmem_device.putmem_signal_nbi_block(\n            scale_dst_ptr,\n            scale_src + src_off,\n            num_rows_cur_block * SCALE_ELEMENT_SIZE,\n            signal_ptr,\n            call_count,\n            libshmem_device.NVSHMEM_SIGNAL_SET,\n            pid,\n        )\n\n    # 内存栅栏\n    libshmem_device.fence()\n    \n    # 信号同步\n    if threadidx == 0:\n        if not WITH_SCALE:\n            libshmem_device.signal_op(\n                signal_ptr,\n                call_count,\n                libshmem_device.NVSHMEM_SIGNAL_SET,\n                pid,\n            )\n        libshmem_device.signal_wait_until(\n            signal + act_pos * WORLD_SIZE + pid,\n            libshmem_device.NVSHMEM_CMP_EQ,\n            call_count,\n        )\n```\n\n**使用说明：**\n1. 将上述代码保存为 `.py` 文件。\n2. 在多卡环境下（需配置好 NCCL\u002FNVSHMEM 环境），通过多进程启动该脚本。\n3. 更多高级用法（如结合 GEMM 计算的重叠内核）请参考项目目录下的 `python\u002Ftriton_dist\u002Fkernels` 示例。","某大型模型团队正在基于 H800 集群部署混合专家（MoE）架构的大语言模型，亟需优化多卡并行推理时的通信延迟与计算效率。\n\n### 没有 Triton-distributed 时\n- **开发门槛极高**：工程师必须手动编写复杂的 CUDA\u002FC++ 代码来实现计算与通信的重叠，调试难度大且周期长。\n- **性能瓶颈明显**：传统的串行执行模式导致 GPU 在等待 AllReduce 等通信操作时空转，算力利用率不足 60%。\n- **算子适配困难**：针对特定硬件（如 H800 或 L20）优化 MoE 路由与聚合算子需要重复造轮子，难以复用现有 Triton 生态。\n- **维护成本高昂**：一旦模型结构或硬件拓扑变更，底层的分布式内核代码往往需要推倒重来，缺乏灵活性。\n\n### 使用 Triton-distributed 后\n- **开发效率飞跃**：团队直接利用熟悉的 Triton 语法编写分布式内核，自动处理底层通信逻辑，将新算子开发周期从数周缩短至数天。\n- **极致性能释放**：通过原生支持的“计算 - 通信重叠”机制，有效掩盖了网络延迟，在 MoE 场景下实现了比肩手写 CUDA 的加速效果（如官方演示的 1.33 倍提速）。\n- **跨平台无缝迁移**：一套代码即可同时适配 NVIDIA H800 和 AMD GPU，无需为不同硬件维护多套后端实现。\n- **精细化调优能力**：借助内置的核内分析器（Intra-Kernel Profiler），开发者能精准定位并行系统中的微小延迟，快速迭代优化策略。\n\nTriton-distributed 让开发者能以纯 Python 的高生产力，轻松构建出媲美顶级手工优化库的高性能分布式 AI 内核。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Triton-distributed_3c0e1072.png","ByteDance-Seed","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FByteDance-Seed_8c020fee.png","",null,"seed.feedback@bytedance.com","https:\u002F\u002Fseed.bytedance.com\u002F","https:\u002F\u002Fgithub.com\u002FByteDance-Seed",[83,87,91,95,99,103,107,110],{"name":84,"color":85,"percentage":86},"Python","#3572A5",91.5,{"name":88,"color":89,"percentage":90},"C++","#f34b7d",4.9,{"name":92,"color":93,"percentage":94},"Cuda","#3A4E3A",2.1,{"name":96,"color":97,"percentage":98},"Shell","#89e051",0.8,{"name":100,"color":101,"percentage":102},"CMake","#DA3434",0.5,{"name":104,"color":105,"percentage":106},"C","#555555",0.1,{"name":108,"color":109,"percentage":106},"MLIR","#5EC8DB",{"name":111,"color":112,"percentage":106},"LLVM","#185619",1400,136,"2026-04-01T12:02:51","MIT",4,"Linux","必需。主要支持 NVIDIA GPU (SM80, SM89, SM90a，如 H800, L20) 和 AMD GPU (CDNA3)。示例中使用了 H800\u002FL20。需安装 CUDA 12.4 环境 (基于 nvcr.io\u002Fnvidia\u002Fpytorch:25.04-py3 容器)。","未说明 (Docker 启动参数建议共享内存 --shm-size=10g)",{"notes":122,"python":123,"dependencies":124},"1. 官方推荐使用特定的 PyTorch Docker 容器 (nvcr.io\u002Fnvidia\u002Fpytorch:25.04-py3) 进行部署。\n2. 通信后端依赖 NVSHMEM，需确保硬件支持 NVLink、InfiniBand 或 PCIe。\n3. 安装前需卸载原有的 triton 包以避免冲突。\n4. 该工具专为计算 - 通信重叠优化，适用于多卡分布式训练\u002F推理场景。","3.12",[125,126,127,128,129,130],"cuda.core==0.2.0","nvidia-nvshmem-cu12==3.3.9","Cython==0.29.24","nvshmem4py-cu12==0.1.2","cuda-python==12.4","PyTorch (通过 Docker 容器 nvcr.io\u002Fnvidia\u002Fpytorch:25.04-py3 提供)",[13],"2026-03-27T02:49:30.150509","2026-04-06T08:40:08.836125",[135,140,145,150,155,159],{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},13304,"构建 NVSHMEM 时遇到 CMake 错误，提示找不到 CUDA 工具包或版本不兼容怎么办？","该问题通常由 CUDA 版本不匹配引起。项目目前不支持 CUDA 12.1 构建 NVSHMEM，请升级使用 CUDA 12.4 或更高版本来解决此问题。如果之前构建失败，建议先删除 `nvshmem\u002Fbuild` 目录，然后重新运行构建命令并检查日志。","https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fissues\u002F35",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},13305,"安装后导入 Triton 时报错 \"ImportError: generic_type: type \\\"builder\\\" is already registered!\" 如何解决？","这通常是因为安装了冲突的 Triton 版本或构建环境残留导致的。请尝试拉取仓库的最新代码，并严格按照 README 中的最新安装说明重新执行安装步骤（包括卸载旧版本和清理缓存）。更新到最新提交（如 2db9f9d）后通常能解决该模块注册冲突问题。","https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fissues\u002F49",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},13306,"在双卡（TP=2）环境下运行 test_tp_attn.py 或 test_tp_mlp.py 时程序挂起（Hang）是什么原因？","这是一个已确认的 Bug。维护者指出此前未在 TP=2 及特定形状下测试相关功能，导致在双卡（如 2xH800 或 2xH20）环境下运行时出现死锁。该问题已被定位并正在修复中，请等待新版本发布或关注后续的修复提交。","https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fissues\u002F81",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},13307,"AMD 教程（10-AMD-overlapping-gemm-reduce-scatter）中 GEMM 和 ReduceScatter 是否实现了真正的重叠执行？","是的，实现了重叠。该教程的设计是通过融合 GEMM 和 Scatter 写操作来实现通信与计算的重叠：当上一个 Tile 数据进行 Scatter 通信时，下一个 Tile 可以并发执行计算。虽然最终的 Reduce 操作在当前 GPU 本地运行且无通信重叠（这与跨节点 GEMM RS 的设计一致），但核心的计算与数据传输流水线是重叠的。","https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fissues\u002F92",{"id":156,"question_zh":157,"answer_zh":158,"source_url":154},13308,"在 AMD 教程的 consumer_reduce kernel 中，为什么需要对 cur_rank 进行轮转（swizzled update）处理？","这是为了正确地从其他 Rank 对应的散射缓冲区中聚合数据。代码 `(i + rank + 1) % num_ranks` 用于计算当前迭代需要读取的源 Rank 索引，确保每个 Rank 都能按顺序收集来自其他所有 Rank 的计算结果并完成归约，即使数据在物理内存上是按 Rank 分散存储的。",{"id":160,"question_zh":161,"answer_zh":162,"source_url":149},13309,"运行测试脚本报错 \"No device id is provided via init_process_group\" 警告如何处理？","这是一个常见的 PyTorch 分布式初始化警告，表示未显式指定设备 ID。虽然程序通常会使用用户当前设置的设备继续运行，但建议在初始化进程组（init_process_group）或通过 barrier 同步时，显式传入 `device_id` 参数或在运行前通过 `torch.cuda.set_device()` 设定当前设备，以消除警告并确保多卡环境下的设备绑定正确。",[164,169,174],{"id":165,"version":166,"summary_zh":167,"released_at":168},72009,"v0.0.2-rc","添加尤利西斯SP内核，并优化EP内核。","2025-09-12T08:49:54",{"id":170,"version":171,"summary_zh":172,"released_at":173},72010,"v0.0.1-rc","## 编译环境\n\n* Triton v3.4\n* NVSHMEM: v3.3.9\n\n## 变更内容\n\n- 功能：在 https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fpull\u002F93 中，由 @XG-zheng 实现了对超大规模内核的支持。\n- 功能：在 https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fpull\u002F85 中，由 @houqi、@XG-zheng、@KnowingNothing、@wenlei-bao 和 @preminstrel 实现了对 Qwen\u002FQwen3-235B-A22B 等端到端 MoE 模型的支持。\n- 功能：支持在 Hopper 架构上进行 GEMM + AllReduce 操作。\n- 功能：在 L20\u002FAmpere 架构上支持 GroupedGEMM + ReduceScatter。\n- 功能：默认使用 NVLS ld_reduce，并采用 .acc::f32 精度进行 BF16\u002FFP16 归约运算，以提升精度。\n- 修复：以向量化方式支持 NVLS multimem.st。\n- 修复：修复了与 cooperative_launch_grids 相关的一些死锁问题。关闭了 https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fissues\u002F81。\n- 修复：修复了 AG+GroupedGEMM 中可能导致意外内存访问的若干 Bug。\n- 优化：在 H800x8 集群上处理极小数据消息时，将 AllReduce 的一次传输延迟优化至 9 微秒。关闭了 https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fissues\u002F57。\n- 优化：修复了 AllReduce 两次传输延迟的性能问题：直接返回对称缓冲区，以减少部分 GPU 到 GPU 的复制开销。\n- 优化：DoubleTree 实现的 AllReduce 性能显著提升，但仍不适合生产环境，需进一步优化流水线。\n- 其他：支持在不依赖 CUDA 工具包和 PyTorch 的情况下进行编译。\n- 启用 rocSHMEM 主机 API 的使用，由 @drprajap 在 https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fpull\u002F68 中实现。\n\n## 已知问题\n\n* 轮子包中未包含 AMD 相关支持。如需尝试 AMD 平台，请自行构建。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed\u002Fcompare\u002Fexperimental...v0.0.1-rc","2025-08-20T01:44:40",{"id":175,"version":176,"summary_zh":177,"released_at":178},72011,"experimental","## 环境配置\n\n- 容器镜像：nvcr.io\u002Fnvidia\u002Fpytorch:25.04-py3\n- Triton 推理服务器 v3.4\n- NVSHMEM4py v3.3.9\n- PyTorch 2.4 及以上版本（不使用 Dynamo，即不支持 TorchCompile）\n- CUDA 12 及以上版本\n- Python 3.12 及以上版本","2025-07-11T10:19:07"]