[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NVIDIA--cutlass":3,"tool-NVIDIA--cutlass":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150720,2,"2026-04-11T11:33:10",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":114,"forks":115,"last_commit_at":116,"license":117,"difficulty_score":118,"env_os":119,"env_gpu":120,"env_ram":121,"env_deps":122,"category_tags":128,"github_topics":129,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":137,"updated_at":138,"faqs":139,"releases":169},6592,"NVIDIA\u002Fcutlass","cutlass","CUDA Templates and Python DSLs for High-Performance Linear Algebra","CUTLASS 是一套专为 CUDA 平台设计的高性能线性代数计算库，核心目标是简化矩阵乘法（GEMM）及相关运算的开发与优化。它通过将复杂的并行层级分解和数据移动策略封装为可复用的模块化组件，有效解决了在 NVIDIA GPU 上手动编写底层内核代码难度大、调优繁琐且难以兼顾多种硬件架构的痛点。\n\n这套工具特别适合高性能计算工程师、深度学习框架开发者以及从事算法研究的学生和学者使用。无论是需要极致性能的底层算子开发，还是希望快速验证新算法原型的科研场景，CUTLASS 都能提供强大支持。其独特亮点在于不仅拥有成熟的 C++ 模板抽象体系，全面覆盖从 FP64 到 INT4 等多种精度及 NVIDIA 历代架构（如 Hopper、Blackwell），更在最新版本中引入了 CuTe DSL。这是一种基于 Python 的领域特定语言，让开发者无需深厚的 C++ 元编程功底，即可直观地控制线程层次与数据布局，在保持原生性能的同时大幅降低学习门槛并显著缩短编译时间，是实现 GPU 算力高效利用的得力助手。","![ALT](.\u002Fmedia\u002Fimages\u002Fgemm-hierarchy-with-epilogue-no-labels.png \"Complete CUDA GEMM decomposition\")\n# Overview\n\n# CUTLASS 4.5.0\n\n_CUTLASS 4.5.0 - March 2026_\n\nCUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM)\nand related computations at all levels and scales within CUDA. It incorporates strategies for\nhierarchical decomposition and data movement. CUTLASS decomposes these \"moving parts\" into reusable, modular\nsoftware components and abstractions.\n\nPrimitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned\nvia custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies\ntheir use as building blocks within custom kernels and applications.\n\nCUTLASS has been providing CUDA C++ template abstractions for high-performance linear algebra since 2017 and\nthese abstractions provide extensive support for a wide range of computations including\nmixed-precision computations, specialized data-movement (async copy) and\nmultiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16,\n[FP32 emulation via tensor core instruction](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F27_ampere_3xtf32_fast_accurate_tensorop_gemm),\n 8b floating point types (e5m2 and e4m3),\n block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8),\n narrow integer types (4 and 8b signed and unsigned integers),\n and binary 1b data types (where architectures allow for the\nnative support of such data types) across NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.\n\nTo this rich ecosystem of C++ based kernel programming abstractions, CUTLASS 4 adds CUTLASS DSLs. These are Python native interfaces for writing high-performance CUDA kernels based on core CUTLASS and CuTe concepts without any performance compromises. This allows for a much smoother learning curve, orders of magnitude faster compile times, native integration with DL frameworks without writing glue code, and much more intuitive metaprogramming that does not require deep C++ expertise.\n\nOverall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). With the release of 4.0, we are releasing the first of these in CuTe DSL. This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.\n\nCuTe DSL demonstrates optimal matrix multiply and other linear algebra operations\ntargeting the programmable, high-throughput _Tensor Cores_ implemented by\nNVIDIA's Ampere, Hopper, and Blackwell architectures.\n\nWe believe it will become an indispensable tool for students, researchers, and performance\nengineers alike — flattening the learning curve of GPU programming, rapidly prototyping kernel\ndesigns, and bringing optimized solutions into production.\n\nCuTe DSL is currently in public beta and will graduate out of beta by end of summer 2025.\n\nTo get started quickly - please refer :\n  - [CUTLASS C++ Quick Start Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html).\n  - [CuTe DSL Quick Start Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002FpythonDSL\u002Fquick_start.html).\n\n# What's New in CUTLASS 4.5\n\n### CuTe DSL\n* Bug fixing and improvements\n  - Improved source code correlation for profiling\u002Fdebugging\n\n### CUTLASS C++\n* Add [example 95](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F95_blackwell_gemm_green_context) to support green context SM partition\n  - Enables launching GEMM on stream with partial SM allocation.\n* Fix some kernel issues:\n  - Fix l2_capacity=0 handling in Blackwell SM100\u002FSM120 kernel templates\n  - Fix CUTLASS clang build issues\n* Fix some profiler issues:\n  - Add missing reference kernels for blockwise GEMM profiler\n* Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!\n* Optimal code generation with CUDA toolkit versions 13.2.\n\nNote: CUTLASS 4.x builds are known to be down on Windows platforms for all CUDA toolkits.\nCUTLASS team is working on a fix.\n\n**See the [CHANGELOG](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002FCHANGELOG.html) for details of all past releases and updates.**\n\n# Performance\n\nCUTLASS primitives are very efficient.  When used to construct device-wide GEMM kernels,\nthey exhibit nearly optimal utilization of peak theoretical throughput. The figure below\nshows CUTLASS 3.8's performance as a % of theoretical peak utilization\non various input and output data types when run on NVIDIA Blackwell SM100 architecture GPU.\n\n![ALT](media\u002Fimages\u002Fcutlass-3.8-blackwell-gemm-peak-performance.svg \"\")\n\nThe two figures below show the continual CUTLASS performance improvements\non an [NVIDIA H100](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdata-center\u002Fh100\u002F) (NVIDIA Hopper architecture) since\nCUTLASS 3.1.\nCUTLASS 3.5.1 was compiled with the [CUDA 12.5u1 Toolkit](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads).\nTensor Core operations are implemented using CUDA's\n[mma](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fparallel-thread-execution\u002Findex.html#warp-level-matrix-instructions-mma) and\n[wgmma](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fparallel-thread-execution\u002Findex.html#asynchronous-warpgroup-level-matrix-instructions) instructions.\n\n![ALT](media\u002Fimages\u002Fcutlass-3.5.1-gemm-peak-performance.png \"\")\n![ALT](media\u002Fimages\u002Fcutlass-3.5.1-gemm-peak-performance-fp8.png \"\")\n\n# CuTe\n\nCUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data.\nCuTe is a collection of C++ CUDA template abstractions for\ndefining and operating on hierarchically multidimensional layouts of threads and data.\nCuTe provides `Layout` and `Tensor` objects that compactly package the type,\nshape, memory space, and layout of data, while performing the complicated indexing for the user.\nThis lets programmers focus on the logical descriptions of their algorithms while\nCuTe does the mechanical bookkeeping for them. With these tools, we can quickly design,\nimplement, and modify all dense linear algebra operations.\n\nThe core abstractions of CuTe are hierarchically multidimensional layouts\nwhich can be composed with data arrays to represent tensors.\nThe representation of layouts is powerful enough to represent nearly\neverything we need to implement efficient dense linear algebra.\nLayouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.\n\nCUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates.\nThis greatly simplifies the design and improves code composability and readability.\nMore documentation specific to CuTe can be found in its\n[dedicated documentation directory](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fcute\u002F00_quickstart.html).\n\n# Compatibility\n\nMinimum requirements:\n\n- Architecture: Volta (compute capability 7.0)\n- Compiler: Must support at least C++17\n- CUDA Toolkit version: 11.4\n\nCUTLASS requires a C++17 host compiler and\nperforms best when built with the [**CUDA 12.8 Toolkit**](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads).\nIt is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, CUDA 11.8, and all other CUDA 12.x versions.\n\n## Operating Systems\n\nWe have tested the following environments.\n\n|**Operating System** | **Compiler** |\n|-----------------|----------|\n| Ubuntu 18.04 | GCC 7.5.0  |\n| Ubuntu 20.04 | GCC 10.3.0 |\n| Ubuntu 22.04 | GCC 11.2.0 |\n\nNote: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended.\n\nNote: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.\nCUTLASS team is working on a fix.\n\n## Hardware\n\nCUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on Volta, Turing, Ampere, Ada, and Hopper architecture based NVIDIA GPUs.\n\n|**GPU**|**CUDA Compute Capability**|**Minimum CUDA Toolkit Required by CUTLASS-3**|\n|---|---|---|\n|NVIDIA V100 Tensor Core GPU            |7.0|11.4|\n|NVIDIA TitanV                          |7.0|11.4|\n|NVIDIA GeForce RTX 20x0 series         |7.5|11.4|\n|NVIDIA T4                              |7.5|11.4|\n|NVIDIA A100 Tensor Core GPU            |8.0|11.4|\n|NVIDIA A10                             |8.6|11.4|\n|NVIDIA GeForce RTX 30x0 series         |8.6|11.4|\n|NVIDIA GeForce RTX 40x0 series         |8.9|11.8|\n|NVIDIA L40                             |8.9|11.8|\n|NVIDIA H100 Tensor Core GPU            |9.0|11.8|\n|NVIDIA H200 Tensor Core GPU            |9.0|11.8|\n|NVIDIA B200 Tensor Core GPU            |10.0|12.8|\n|NVIDIA B300 Tensor Core GPU            |10.3|13.0|\n|NVIDIA DRIVE Thor                      |11.0|13.0|\n|NVIDIA GeForce RTX 50x0 series         |12.0|12.8|\n|NVIDIA DGX Spark                       |12.1|13.0|\n\n## Target Architecture\n\nIn general, PTX code generated for one target architecture can be run on future architectures\n(i.e., it is forward compatible).\nHowever, CUDA 12.0 introduced the concept of \"architecture-accelerated features\" whose\nPTX does not have forward compatibility guarantees.\nSeveral Hopper and Blackwell PTX instructions fall under this category of\narchitecture-accelerated features, and thus require a `sm_90a` or `sm100a` target architecture\n(note the \"a\" appended). For more details on this and other architecture-accelerated instructions,\nplease refer to the [CUDA Documentation](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-c-programming-guide\u002Findex.html#feature-availability).\n\nThe target architecture information is passed on to CUTLASS via the cmake flag\n`CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100,\nusers are required to build CUTLASS with `90a` as the target architecture.\nIf a user accidentally builds a kernel which uses SM90a features\n(e.g. Hopper Tensor Core Instructions), using the SM90 target\n(note the lack of \"a\"), with either CUDA Toolkit 12 or 11.8,\nthe kernel is expected to fail with a runtime error.\n\n```\ncmake .. -DCUTLASS_NVCC_ARCHS=\"90a\"\n```\nOr\n\n```\ncmake .. -DCUTLASS_NVCC_ARCHS=\"100a\"\n```\n\nNote: The NVIDIA Blackwell SM100 architecture used in the datacenter\nproducts has a different compute capability than the one underpinning\nNVIDIA Blackwell GeForce RTX 50 series GPUs (SM120). As a result, kernels\ncompiled for Blackwell SM100 architecture with arch conditional features\n(using `sm100a`) are not compatible with RTX 50 series GPUs.\n\nPlease refer to the [functionality documentation](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Ffunctionality.html)\nfor details on which kernels require which target architectures.\n\n# Documentation\n\nCUTLASS is described in the following documents and the accompanying\n[Doxygen documentation](https:\u002F\u002Fnvidia.github.io\u002Fcutlass).\n\n- [Quick Start Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html) - basics of building and running CUTLASS\n- [Functionality](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Ffunctionality.html) - summarizes functionality available in CUTLASS\n- [Efficient GEMM in CUDA](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fefficient_gemm.html) - describes how GEMM kernels may be implemented efficiently in CUDA\n- [CUTLASS 3.x Design](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fcutlass_3x_design.html) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components\n- [GEMM API 3.x](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fgemm_api_3x.html) - describes the CUTLASS 3.x GEMM model and C++ template concepts\n- [GEMM API 2.x](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fgemm_api.html) - describes the CUTLASS 2.x GEMM model and C++ template concepts\n- [Implicit GEMM Convolution](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fimplicit_gemm_convolution.html) - describes 2-D and 3-D convolution in CUTLASS\n- [Code Organization](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fcode_organization.html) - describes the organization and contents of the CUTLASS project\n- [Terminology](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fterminology.html) - describes terms used in the code\n- [Programming Guidelines](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fprogramming_guidelines.html) - guidelines for writing efficient modern CUDA C++\n- [Fundamental types](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Ffundamental_types.html) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays\n- [Layouts](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Flayout.html) - describes layouts of matrices and tensors in memory\n- [Tile Iterators](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Ftile_iterator_concept.html) - describes C++ concepts for iterating over tiles of matrices in memory\n- [CUTLASS Profiler](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fprofiler.html) - command-line driven profiling application\n- [CUTLASS Utilities](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Futilities.html) - additional templates used to facilitate rapid development\n- [Dependent kernel launch](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fdependent_kernel_launch.html) - describes a new feature in Hopper which allows overlapping dependent\nkernels in the same stream, and how it is used in CUTLASS.\n\n# Resources\nWe have also described the structure of an efficient GEMM in our talk at the\n[GPU Technology Conference 2018](http:\u002F\u002Fon-demand.gputechconf.com\u002Fgtc\u002F2018\u002Fpresentation\u002Fs8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf).\n\n- [CUTLASS: Software Primitives for Dense Linear Algebra at All Levels and Scales within CUDA](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcsiliconvalley2018-s8854\u002F)\n- [Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcsj20-s21745\u002F)\n- [Accelerating Convolution with Tensor Cores in CUTLASS](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring21-s31883\u002F)\n- [Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring22-s41996\u002F)\n- [CUTLASS: Python API, Enhancements, and NVIDIA Hopper](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcfall22-a41131\u002F)\n\n# Building CUTLASS\n\nCUTLASS is a header-only template library and does not need to be built to be used by other\nprojects. Client applications should target CUTLASS's `include\u002F` directory in their include\npaths.\n\nCUTLASS unit tests, examples, and utilities can be build with CMake.\nThe minimum version of CMake is given in the [Quickstart guide](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html).\nMake sure the `CUDACXX` environment  variable points to NVCC in the CUDA Toolkit installed\non your system.\n\n```bash\n$ export CUDACXX=${CUDA_INSTALL_PATH}\u002Fbin\u002Fnvcc\n```\n\nCreate a build directory within the CUTLASS project, then run CMake. By default CUTLASS will build kernels\nfor CUDA architecture versions 5.0, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6, 8.9, and 9.0.\nTo reduce compile time you can specify\nthe architectures to build CUTLASS for by changing the CMake configuration setting\n`CUTLASS_NVCC_ARCHS`.\n\n```bash\n$ mkdir build && cd build\n\n$ cmake .. -DCUTLASS_NVCC_ARCHS=80               # compiles for NVIDIA's Ampere Architecture\n```\n\nFrom the `build\u002F` directory, compile and run the CUTLASS unit tests by building the target `test_unit` with make.\n\nThe unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS,\nand they may be executed in parallel via make's `-j` command line argument.\n\n```bash\n$ make test_unit -j\n...\n...\n...\n[----------] Global test environment tear-down\n[==========] 946 tests from 57 test cases ran. (10812 ms total)\n[  PASSED  ] 946 tests.\n```\n\nAll tests should pass on supported platforms, though the exact number of tests may vary over time.\n\n\n# Project Structure\n\nCUTLASS is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests.\n[Doxygen documentation](https:\u002F\u002Fnvidia.github.io\u002Fcutlass) provides a complete list of files, classes,\nand template concepts defined in the CUTLASS project.\n\nA detailed explanation of the source code organization may be found in the\n[CUTLASS documentation](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fcode_organization.html), but several main components are summarized below.\n\n## CUTLASS Template Library\n\n```\ninclude\u002F                     # client applications should target this directory in their build's include paths\n\n  cutlass\u002F                   # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only\n\n    arch\u002F                    # direct exposure of architecture features (including instruction-level GEMMs)\n\n    conv\u002F                    # code specialized for convolution\n\n    epilogue\u002F                # code specialized for the epilogue of gemm\u002Fconvolution\n\n    gemm\u002F                    # code specialized for general matrix product computations\n\n    layout\u002F                  # layout definitions for matrices, tensors, and other mathematical objects in memory\n\n    platform\u002F                # CUDA-capable Standard Library components\n\n    reduction\u002F               # bandwidth-limited reduction kernels that do not fit the \"gemm\" model\n\n    thread\u002F                  # simt code that can be performed within a CUDA thread\n\n    transform\u002F               # code specialized for layout, type, and domain transformations\n\n    *                        # core vocabulary types, containers, and basic numeric operations\n\n  cute\u002F                      # CuTe Layout, layout algebra, MMA\u002FCopy atoms, tiled MMA\u002FCopy\n\n    algorithm\u002F               # Definitions of core operations such as copy, gemm, and operations on cute::tuples\n\n    arch\u002F                    # Bare bones PTX wrapper structs for copy and math instructions\n\n    atom\u002F                    # Meta-information either link to or built from arch\u002F operators\n\n      mma_atom.hpp           # cute::Mma_Atom and cute::TiledMma\n\n      copy_atom.hpp          # cute::Copy_Atom and cute::TiledCopy\n\n      *sm*.hpp               # Arch specific meta-information for copy and math operations\n\n    *                        # Core library types such as Shape, Stride, Layout, Tensor, and associated operations\n\n```\n\n### CUTLASS SDK Examples\n\n[CUTLASS SDK examples](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples) apply CUTLASS templates to implement basic computations.\n\n### Tools\n\n```\ntools\u002F\n  library\u002F                   # CUTLASS Instance Library - contains instantiations of all supported CUTLASS templates\n    include\u002F\n      cutlass\u002F\n        library\u002F\n\n  profiler\u002F                  # CUTLASS Profiler         - command-line utility for executing operations in the\n                             #                            CUTLASS Library\n\n  util\u002F                      # CUTLASS Utilities        - contains numerous helper classes for\n    include\u002F                 #                            managing tensors in device memory, reference\n      cutlass\u002F               #                            implementations for GEMM, random initialization\n        util\u002F                #                            of tensors, and I\u002FO.\n```\n\n### Test\n\nThe `test\u002Funit\u002F` directory consist of unit tests implemented with Google Test that demonstrate\nbasic usage of Core API components and complete tests of the CUTLASS GEMM computations.\n\nInstructions for building and running the Unit tests are described in the [Quickstart guide](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html).\n\n# Performance Profiling\n\nThe `tools\u002Fprofiler\u002F` directory contains a command-line utility for launching each of the GEMM kernels.\nIt can be built as follows:\n\n```bash\n$ make cutlass_profiler -j16\n```\n## Building all GEMM and Convolution kernels (_long_ build times)\n\nBy default, only one tile size is instantiated for each data type, math instruction, and layout.\nTo instantiate all, set the following environment variable when running CMake from an empty `build\u002F` directory.\nBeware, this results in *tens of thousands* of kernels and long build times.\nThis would also result in a large binary size and on some platforms linker to fail on building the library.\nTherefore, it's highly recommended to generate only a subset of kernels as demonstrated in the sub-section below.\n```bash\n$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all\n...\n$ make cutlass_profiler -j16\n```\n\n## Building a subset of GEMM and Convolution kernels (_reduced_ build times)\n\nTo compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with\nwildcard characters may be used to reduce the set of kernels. The following examples show building exactly one\nor a subset of kernels for NVIDIA Ampere and Turing architecture:\n\n### Building a subset Tensor Core GEMM kernels\n\nTo compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture,\nuse the below cmake command line:\n```bash\n$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8\n...\n$ make cutlass_profiler -j16\n```\n\nExample command line for profiling a subset of Tensor Core GEMM kernels is as follows:\n```bash\n.\u002Ftools\u002Fprofiler\u002Fcutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096\n\n...\n=============================\n  Problem ID: 1\n\n        Provider: CUTLASS\n   OperationKind: gemm\n       Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8\n\n          Status: Success\n    Verification: ON\n     Disposition: Passed\n\nreference_device: Passed\n          cuBLAS: Passed\n\n       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1  \\\n                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128  \\\n                  --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75  \\\n                  --max_cc=1024\n\n           Bytes: 118489088  bytes\n           FLOPs: 115992428544  flops\n\n         Runtime: 1.55948  ms\n          Memory: 70.7616 GiB\u002Fs\n\n            Math: 74378.8 GFLOP\u002Fs\n\n\n\n=============================\n...\n```\n\n### Building one CUDA Core GEMM kernel\n\nTo compile one SGEMM kernel targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:\n```bash\n$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1\n...\n$ make cutlass_profiler -j16\n```\n\nExample command line for profiling single SGEMM CUDA kernel is as follows:\n```bash\n$ .\u002Ftools\u002Fprofiler\u002Fcutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096\n\n=============================\n  Problem ID: 1\n\n        Provider: CUTLASS\n   OperationKind: gemm\n       Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1\n\n          Status: Success\n    Verification: ON\n     Disposition: Passed\n\n          cuBLAS: Passed\n\n       Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1  \\\n                  --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \\\n                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024\n\n           Bytes: 180355072  bytes\n           FLOPs: 115992428544  flops\n\n         Runtime: 6.73655  ms\n          Memory: 24.934 GiB\u002Fs\n\n            Math: 17218.4 GFLOP\u002Fs\n\n=============================\n```\n\n### Building a subset of Tensor Core Convolution kernels\n\nTo compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation\nand FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:\n```bash\n$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16\n...\n$ make cutlass_profiler -j16\n```\n\nExample command line for profiling a subset of Tensor Core convolution kernels is as follows:\n\n```bash\n$ .\u002Ftools\u002Fprofiler\u002Fcutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3\n\n...\n=============================\n  Problem ID: 1\n\n        Provider: CUTLASS\n   OperationKind: conv2d\n       Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc\n\n          Status: Success\n    Verification: ON\n     Disposition: Passed\n\nreference_device: Passed\n\n       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \\\n                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc  \\\n                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \\\n                  --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5  \\\n                  --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024\n\n           Bytes: 1130659840  bytes\n           FLOPs: 118482796544  flops\n\n         Runtime: 0.711496  ms\n          Memory: 1479.99 GiB\u002Fs\n\n            Math: 166526 GFLOP\u002Fs\n\n=============================\n...\n```\n\n\n### Building one Convolution CUDA kernel\n\nTo compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation\nand FP32 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:\n```bash\n$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc\n...\n$ make cutlass_profiler -j16\n```\n\nExample command line for profiling one CUDA Core convolution kernel:\n\n```bash\n$ .\u002Ftools\u002Fprofiler\u002Fcutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3\n\n\n=============================\n  Problem ID: 1\n\n        Provider: CUTLASS\n   OperationKind: conv2d\n       Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc\n\n          Status: Success\n    Verification: ON\n     Disposition: Passed\n\nreference_device: Passed\n\n       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \\\n                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc  \\\n                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \\\n                  --eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \\\n                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024\n\n           Bytes: 2055798784  bytes\n           FLOPs: 118482796544  flops\n\n         Runtime: 7.34266  ms\n          Memory: 260.752 GiB\u002Fs\n\n            Math: 16136.2 GFLOP\u002Fs\n\n\n=============================\n\n```\n\n## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler\n- Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:\n  - [GEMM CMake Examples](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html#gemm-cmake-examples)\n  - [Implicit GEMM convolution CMake Examples](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html#convolution-cmake-examples)\n- [Further details about the CUTLASS Profiler are described here.](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fprofiler.html)\n\n\n# About\n\nCUTLASS is released by NVIDIA Corporation as Open Source software under the\n[3-clause \"New\" BSD license](LICENSE.txt).\n\n# Contributors\n\nThe official list of CUTLASS developers and contributors is available here: [CONTRIBUTORS](CONTRIBUTORS.md).\n\n# Copyright\n\nCopyright (c) 2017 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\nSPDX-License-Identifier: BSD-3-Clause\n\n```\n  Redistribution and use in source and binary forms, with or without\n  modification, are permitted provided that the following conditions are met:\n\n  1. Redistributions of source code must retain the above copyright notice, this\n  list of conditions and the following disclaimer.\n\n  2. Redistributions in binary form must reproduce the above copyright notice,\n  this list of conditions and the following disclaimer in the documentation\n  and\u002For other materials provided with the distribution.\n\n  3. Neither the name of the copyright holder nor the names of its\n  contributors may be used to endorse or promote products derived from\n  this software without specific prior written permission.\n\n  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\n  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\n  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\n  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE\n  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\n  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR\n  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER\n  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,\n  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE\n  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n```\n","![ALT](.\u002Fmedia\u002Fimages\u002Fgemm-hierarchy-with-epilogue-no-labels.png \"完整的CUDA GEMM分解\")\n# 概述\n\n# CUTLASS 4.5.0\n\n_CUTLASS 4.5.0 - 2026年3月_\n\nCUTLASS是一套用于在CUDA中实现高性能矩阵乘法（GEMM）及相关计算的抽象库，覆盖了所有层次和规模。它结合了分层分解和数据移动策略，将这些“可移动部分”拆解为可重用、模块化的软件组件和抽象。\n\n针对概念性并行化层次中的不同层级，可以通过自定义分块大小、数据类型及其他算法策略进行专门化和调优。这种灵活性使得它们能够更简便地作为自定义内核和应用程序中的构建模块使用。\n\n自2017年以来，CUTLASS一直提供用于高性能线性代数的CUDA C++模板抽象，并广泛支持多种计算任务，包括混合精度计算、专用数据移动（异步复制）以及适用于FP64、FP32、TF32、FP16、BF16等的数据累加抽象；还支持通过张量核心指令实现的FP32模拟计算[参见GitHub仓库：NVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F27_ampere_3xtf32_fast_accurate_tensorop_gemm]；此外，还支持8位浮点类型（e5m2和e4m3）、块尺度数据类型（如NVIDIA NVFP4及OCP标准MXFP4、MXFP6、MXFP8）、窄整数类型（4位和8位有符号与无符号整数）以及二进制1位数据类型（在架构原生支持的情况下），上述功能均能在NVIDIA的Volta、Turing、Ampere、Ada、Hopper和Blackwell架构上运行。\n\n为了进一步丰富这一基于C++的内核编程抽象生态系统，CUTLASS 4引入了CUTLASS DSL。这是一种Python原生接口，允许开发者基于CUTLASS和CuTe的核心概念编写高性能CUDA内核，且不会带来任何性能损失。这不仅显著降低了学习曲线，还将编译时间缩短了多个数量级，实现了与深度学习框架的原生集成而无需编写胶水代码，同时提供了更加直观的元编程方式，无需深厚的C++专业知识。\n\n总体而言，我们设想CUTLASS DSL将成为一系列领域特定语言（DSL）。随着4.0版本的发布，我们首先推出了CuTe DSL。这是一种低层次编程模型，完全兼容CuTe C++抽象——暴露了布局、张量、硬件原子等核心概念，并对硬件线程和数据层次结构拥有完全控制权。\n\nCuTe DSL展示了针对NVIDIA Ampere、Hopper和Blackwell架构所实现的可编程高吞吐量张量核心的最佳矩阵乘法和其他线性代数操作。\n\n我们相信，它将成为学生、研究人员和性能工程师不可或缺的工具——帮助他们降低GPU编程的学习难度，快速原型化内核设计，并将优化后的解决方案投入生产。\n\nCuTe DSL目前处于公开测试阶段，预计将于2025年夏季末结束测试并正式发布。\n\n欲快速入门，请参考：\n  - [CUTLASS C++快速入门指南](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html)。\n  - [CuTe DSL快速入门指南](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002FpythonDSL\u002Fquick_start.html)。\n\n# CUTLASS 4.5的新特性\n\n### CuTe DSL\n* 错误修复与改进\n  - 改进了源代码与性能分析\u002F调试之间的关联性\n\n### CUTLASS C++\n* 添加示例95，以支持绿色上下文SM分区\n  - 允许在部分SM分配的流上启动GEMM。\n* 修复了一些内核问题：\n  - 修复了Blackwell SM100\u002FSM120内核模板中l2_capacity=0的处理问题\n  - 修复了CUTLASS的Clang构建问题\n* 修复了一些性能分析工具的问题：\n  - 为分块GEMM性能分析工具添加了缺失的参考内核\n* 来自社区和CUTLASS团队的各种改进与修复。感谢所有提交PR的贡献者！\n* 使用CUDA工具包13.2版本生成最优代码。\n\n注意：已知CUTLASS 4.x版本在Windows平台上无法正常构建，无论使用哪个CUDA工具包。CUTLASS团队正在努力修复此问题。\n\n**有关所有过往版本及更新的详细信息，请参阅[CHANGELOG](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002FCHANGELOG.html)。**\n\n# 性能\n\nCUTLASS提供的原语非常高效。当用于构建设备级别的GEMM内核时，它们几乎可以达到理论峰值吞吐量的充分利用。下图显示了CUTLASS 3.8在NVIDIA Blackwell SM100架构GPU上运行时，针对不同输入输出数据类型的理论峰值利用率百分比。\n\n![ALT](media\u002Fimages\u002Fcutlass-3.8-blackwell-gemm-peak-performance.svg \"\")\n\n以下两张图展示了自CUTLASS 3.1以来，在[NVIDIA H100](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdata-center\u002Fh100\u002F)（NVIDIA Hopper架构）上CUTLASS性能的持续提升。CUTLASS 3.5.1是使用[CUDA 12.5u1工具包](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads)编译的。张量核心操作是通过CUDA的`mma`和`wgmma`指令实现的。\n\n![ALT](media\u002Fimages\u002Fcutlass-3.5.1-gemm-peak-performance.png \"\")\n![ALT](media\u002Fimages\u002Fcutlass-3.5.1-gemm-peak-performance-fp8.png \"\")\n\n# CuTe\n\nCUTLASS 3.0引入了一个新的核心库CuTe，用于描述和操作线程与数据的张量。CuTe是一组C++ CUDA模板抽象，用于定义和操作线程与数据的分层多维布局。CuTe提供了`Layout`和`Tensor`对象，能够紧凑地封装数据的类型、形状、内存空间和布局，同时为用户自动完成复杂的索引计算。这样一来，程序员只需关注算法的逻辑描述，而繁琐的底层管理工作则由CuTe代劳。借助这些工具，我们可以快速设计、实现和修改所有的密集线性代数运算。\n\nCuTe的核心抽象是分层多维布局，它可以与数据数组组合起来表示张量。这种布局表示能力强大，几乎可以涵盖实现高效密集线性代数所需的全部内容。布局还可以通过函数式组合进行合并与操作，我们在此基础上构建了一整套常用的操作，例如分块和划分。\n\n从CUTLASS 3.0开始，其模板中的GEMM层次结构全面采用了CuTe。这大大简化了设计过程，并提高了代码的可组合性和可读性。更多关于CuTe的文档可以在其[专用文档目录](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fcute\u002F00_quickstart.html)中找到。\n\n# 兼容性\n\n最低要求：\n\n- 架构：Volta（计算能力 7.0）\n- 编译器：必须支持至少 C++17\n- CUDA 工具包版本：11.4\n\nCUTLASS 需要一个支持 C++17 的主机编译器，并且在使用 [**CUDA 12.8 工具包**](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads) 构建时表现最佳。它也兼容 CUDA 11.4、CUDA 11.5、CUDA 11.6、CUDA 11.7、CUDA 11.8，以及所有其他 CUDA 12.x 版本。\n\n## 操作系统\n\n我们已测试了以下环境。\n\n|**操作系统** | **编译器** |\n|-----------------|----------|\n| Ubuntu 18.04 | GCC 7.5.0  |\n| Ubuntu 20.04 | GCC 10.3.0 |\n| Ubuntu 22.04 | GCC 11.2.0 |\n\n注意：GCC 8.5.0 在折叠表达式和重载运算符方面存在已知的回归问题。建议使用 GCC 7.5.0 或（更优）GCC >= 9。\n\n注意：已知 CUTLASS 3.x 在 Windows 平台上无法构建，无论使用哪个 CUDA 工具包。CUTLASS 团队正在努力修复此问题。\n\n## 硬件\n\nCUTLASS 可以在以下 NVIDIA GPU 上成功运行，并且预计在基于 Volta、Turing、Ampere、Ada 和 Hopper 架构的 NVIDIA GPU 上具有高效性能。\n\n|**GPU**|**CUDA 计算能力**|**CUTLASS-3 所需的最低 CUDA 工具包版本**|\n|---|---|---|\n|NVIDIA V100 Tensor Core GPU            |7.0|11.4|\n|NVIDIA TitanV                          |7.0|11.4|\n|NVIDIA GeForce RTX 20x0 系列         |7.5|11.4|\n|NVIDIA T4                              |7.5|11.4|\n|NVIDIA A100 Tensor Core GPU            |8.0|11.4|\n|NVIDIA A10                             |8.6|11.4|\n|NVIDIA GeForce RTX 30x0 系列         |8.6|11.4|\n|NVIDIA GeForce RTX 40x0 系列         |8.9|11.8|\n|NVIDIA L40                             |8.9|11.8|\n|NVIDIA H100 Tensor Core GPU            |9.0|11.8|\n|NVIDIA H200 Tensor Core GPU            |9.0|11.8|\n|NVIDIA B200 Tensor Core GPU            |10.0|12.8|\n|NVIDIA B300 Tensor Core GPU            |10.3|13.0|\n|NVIDIA DRIVE Thor                      |11.0|13.0|\n|NVIDIA GeForce RTX 50x0 系列         |12.0|12.8|\n|NVIDIA DGX Spark                       |12.1|13.0|\n\n## 目标架构\n\n一般来说，为某一目标架构生成的 PTX 代码可以在未来的架构上运行（即向前兼容）。然而，CUDA 12.0 引入了“架构加速特性”的概念，其 PTX 不具备向前兼容性保证。Hopper 和 Blackwell 的一些 PTX 指令属于这一类架构加速特性，因此需要指定 `sm_90a` 或 `sm100a` 作为目标架构（注意末尾的“a”）。有关此类及其他架构加速指令的详细信息，请参阅 [CUDA 文档](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-c-programming-guide\u002Findex.html#feature-availability)。\n\n目标架构信息通过 CMake 标志 `CUTLASS_NVCC_ARCHS` 传递给 CUTLASS。为了在 Hopper GH100 上实现最佳性能，用户需要将目标架构设置为 `90a` 来构建 CUTLASS。如果用户不小心构建了一个使用 SM90a 特性（例如 Hopper Tensor Core 指令）的内核，并使用 SM90 目标架构（缺少“a”），无论使用 CUDA 工具包 12 还是 11.8，该内核都可能会因运行时错误而失败。\n\n```\ncmake .. -DCUTLASS_NVCC_ARCHS=\"90a\"\n```\n或者\n\n```\ncmake .. -DCUTLASS_NVCC_ARCHS=\"100a\"\n```\n\n注意：数据中心产品中使用的 NVIDIA Blackwell SM100 架构与支撑 NVIDIA Blackwell GeForce RTX 50 系列 GPU 的架构（SM120）具有不同的计算能力。因此，为 Blackwell SM100 架构编译并包含架构条件特性的内核（使用 `sm100a`）与 RTX 50 系列 GPU 不兼容。\n\n请参阅 [功能文档](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Ffunctionality.html)，了解哪些内核需要哪些目标架构的详细信息。\n\n# 文档\n\nCUTLASS 的相关信息记录在以下文档中，并附有 [Doxygen 文档](https:\u002F\u002Fnvidia.github.io\u002Fcutlass)。\n\n- [快速入门指南](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html) - CUTLASS 的构建和运行基础\n- [功能概述](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Ffunctionality.html) - 总结了 CUTLASS 提供的功能\n- [CUDA 中高效的 GEMM](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fefficient_gemm.html) - 描述了如何在 CUDA 中高效地实现 GEMM 内核\n- [CUTLASS 3.x 设计](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fcutlass_3x_design.html) - 介绍了 CUTLASS 3.x 的设计、优势，以及 CuTe 如何使我们能够编写更具组合性的组件\n- [GEMM API 3.x](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fgemm_api_3x.html) - 描述了 CUTLASS 3.x 的 GEMM 模型和 C++ 模板概念\n- [GEMM API 2.x](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fgemm_api.html) - 描述了 CUTLASS 2.x 的 GEMM 模型和 C++ 模板概念\n- [隐式 GEMM 卷积](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fimplicit_gemm_convolution.html) - 描述了 CUTLASS 中的 2D 和 3D 卷积\n- [代码组织](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fcode_organization.html) - 描述了 CUTLASS 项目的组织结构和内容\n- [术语](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fterminology.html) - 解释了代码中使用的术语\n- [编程指南](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fprogramming_guidelines.html) - 提供了编写高效现代 CUDA C++ 的指导原则\n- [基础类型](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Ffundamental_types.html) - 描述了 CUTLASS 中用于表示数值和数组的基本 C++ 类\n- [布局](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Flayout.html) - 描述了矩阵和张量在内存中的布局\n- [分块迭代器](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Ftile_iterator_concept.html) - 描述了用于在内存中遍历矩阵分块的 C++ 概念\n- [CUTLASS 性能分析工具](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fprofiler.html) - 基于命令行的性能分析应用程序\n- [CUTLASS 实用工具](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Futilities.html) - 提供了额外的模板，以促进快速开发\n- [依赖内核启动](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fdependent_kernel_launch.html) - 描述了 Hopper 中的一项新特性，允许在同一流中重叠执行依赖内核，以及它在 CUTLASS 中的应用。\n\n# 资源\n我们还在2018年GPU技术大会上发表的演讲中描述了高效GEMM的结构：\n[GPU技术大会2018](http:\u002F\u002Fon-demand.gputechconf.com\u002Fgtc\u002F2018\u002Fpresentation\u002Fs8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf)。\n\n- [CUTLASS：CUDA中各层次和规模的密集线性代数软件原语](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcsiliconvalley2018-s8854\u002F)\n- [开发CUDA内核以将Tensor Cores推至NVIDIA A100的极限](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcsj20-s21745\u002F)\n- [使用CUTLASS中的Tensor Cores加速卷积](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring21-s31883\u002F)\n- [通过提高CUTLASS中Tensor Cores的利用率来加速反向数据梯度计算](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring22-s41996\u002F)\n- [CUTLASS：Python API、增强功能以及NVIDIA Hopper](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcfall22-a41131\u002F)\n\n# 构建CUTLASS\n\nCUTLASS是一个仅包含头文件的模板库，无需构建即可被其他项目使用。客户端应用程序应在其包含路径中指定CUTLASS的`include\u002F`目录。\n\nCUTLASS的单元测试、示例和实用工具可以通过CMake进行构建。CMake的最低版本在[快速入门指南](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html)中给出。请确保`CUDACXX`环境变量指向您系统上安装的CUDA Toolkit中的NVCC。\n\n```bash\n$ export CUDACXX=${CUDA_INSTALL_PATH}\u002Fbin\u002Fnvcc\n```\n\n在CUTLASS项目中创建一个构建目录，然后运行CMake。默认情况下，CUTLASS会为CUDA架构版本5.0、6.0、6.1、7.0、7.5、8.0、8.6、8.9和9.0编译内核。为了减少编译时间，您可以更改CMake配置设置`CUTLASS_NVCC_ARCHS`来指定要为哪些架构构建CUTLASS。\n\n```bash\n$ mkdir build && cd build\n\n$ cmake .. -DCUTLASS_NVCC_ARCHS=80               # 为NVIDIA的Ampere架构编译\n```\n\n从`build\u002F`目录中，通过使用make构建目标`test_unit`来编译并运行CUTLASS的单元测试。\n\n单元测试被组织成几个二进制文件，分别对应CUTLASS的顶级命名空间，可以通过make的`-j`命令行参数并行执行。\n\n```bash\n$ make test_unit -j\n...\n...\n...\n[----------] 全局测试环境清理\n[==========] 共有57个测试用例中的946个测试已运行。（总耗时10812毫秒）\n[  PASSED  ] 946个测试。\n```\n\n在支持的平台上，所有测试都应通过，尽管测试的具体数量可能会随时间变化。\n\n\n# 项目结构\n\nCUTLASS由一个仅包含头文件的库以及实用工具、工具、示例和单元测试组成。[Doxygen文档](https:\u002F\u002Fnvidia.github.io\u002Fcutlass)提供了CUTLASS项目中定义的文件、类和模板概念的完整列表。\n\n关于源代码组织的详细说明可以在[CUTLASS文档](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fcode_organization.html)中找到，但以下几个主要组件在此简要概述。\n\n## CUTLASS模板库\n\n```\ninclude\u002F                     # 客户端应用程序应在构建的包含路径中指定此目录\n\n  cutlass\u002F                   # 线性代数子程序和求解器的CUDA模板——仅包含头文件\n\n    arch\u002F                    # 直接暴露架构特性（包括指令级GEMM）\n\n    conv\u002F                    # 针对卷积优化的代码\n\n    epilogue\u002F                # 针对GEMM\u002F卷积后处理阶段优化的代码\n\n    gemm\u002F                    # 针对通用矩阵乘法计算优化的代码\n\n    layout\u002F                  # 矩阵、张量及其他数学对象在内存中的布局定义\n\n    platform\u002F                # 具备CUDA功能的标准库组件\n\n    reduction\u002F               # 带宽受限的归约内核，不符合“gemm”模型\n\n    thread\u002F                  # 可在CUDA线程内执行的SIMT代码\n\n    transform\u002F               # 针对布局、类型和域转换优化的代码\n\n    *                        # 核心词汇类型、容器和基本数值运算\n\n  cute\u002F                      # CuTe布局、布局代数、MMA\u002FCopy原子、分块MMA\u002FCopy\n\n    algorithm\u002F               # 对核心操作的定义，如复制、GEMM以及对cute::tuple的操作\n\n    arch\u002F                    # 复制和数学指令的裸PTX包装结构体\n\n    atom\u002F                    # 与arch\u002F运算符相关联或基于arch\u002F运算符构建的元信息\n\n      mma_atom.hpp           # cute::Mma_Atom和cute::TiledMma\n\n      copy_atom.hpp          # cute::Copy_Atom和cute::TiledCopy\n\n      *sm*.hpp               # 针对复制和数学操作的特定架构元信息\n\n    *                        # 核心库类型，如Shape、Stride、Layout、Tensor及其相关操作\n\n```\n\n### CUTLASS SDK示例\n\n[CUTLASS SDK示例](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples)应用CUTLASS模板来实现基本计算。\n\n### 工具\n\n```\ntools\u002F\n  library\u002F                   # CUTLASS实例库——包含所有受支持的CUTLASS模板的实例化版本\n    include\u002F\n      cutlass\u002F\n        library\u002F\n\n  profiler\u002F                  # CUTLASS性能分析工具         ——用于在CUTLASS库中执行操作的命令行实用程序\n\n  util\u002F                      # CUTLASS实用工具        ——包含大量帮助类，用于\n    include\u002F                 #                            管理设备内存中的张量、GEMM的参考实现、\n      cutlass\u002F               #                            张量的随机初始化以及输入输出。\n        util\u002F\n```\n\n### 测试\n\n`test\u002Funit\u002F`目录包含使用Google Test实现的单元测试，展示了核心API组件的基本用法以及对CUTLASS GEMM计算的全面测试。\n\n构建和运行单元测试的说明在[快速入门指南](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html)中有所介绍。\n\n# 性能剖析\n\n`tools\u002Fprofiler\u002F`目录包含一个用于启动每个GEMM内核的命令行实用程序。可以按以下方式构建：\n\n```bash\n$ make cutlass_profiler -j16\n```\n\n## 构建所有 GEMM 和卷积核（*耗时极长*）\n\n默认情况下，对于每种数据类型、数学指令和布局，仅实例化一个分块大小。\n若要实例化所有分块大小，请在从空的 `build\u002F` 目录运行 CMake 时设置以下环境变量。\n请注意，这将生成 *数以万计* 的内核，并导致漫长的构建时间。\n此外，还会使二进制文件体积庞大，在某些平台上可能导致链接器在构建库时失败。\n因此，强烈建议仅生成一部分内核，如下一小节所示。\n```bash\n$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all\n...\n$ make cutlass_profiler -j16\n```\n\n## 构建部分 GEMM 和卷积核（*缩短构建时间*）\n\n若要严格编译单个内核或少量内核，可以使用逗号分隔的内核名称列表，并配合通配符来缩小内核集合。以下示例展示了为 NVIDIA Ampere 和 Turing 架构精确编译单个或部分内核的方法：\n\n### 构建部分 Tensor Core GEMM 内核\n\n若要编译针对 NVIDIA Ampere 和 Turing 架构、采用 FP32 累加且 FP16 输入的部分 Tensor Core GEMM 内核，请使用以下 CMake 命令行：\n```bash\n$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8\n...\n$ make cutlass_profiler -j16\n```\n\n用于分析部分 Tensor Core GEMM 内核的命令行示例如下：\n```bash\n.\u002Ftools\u002Fprofiler\u002Fcutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096\n\n...\n=============================\n  问题编号：1\n\n        提供者：CUTLASS\n   操作类型：GEMM\n       操作：cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8\n\n          状态：成功\n    验证：开启\n     处理结果：通过\n\nreference_device：通过\n          cuBLAS：通过\n\n       参数：--gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:列 --B=f16:行 --C=f32:列 --alpha=1  \\\n                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128  \\\n                  --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75  \\\n                  --max_cc=1024\n\n           字节数：118,489,088 字节\n           浮点运算次数：115,992,428,544 次浮点运算\n\n         运行时间：1.55948 毫秒\n          内存带宽：70.7616 GiB\u002Fs\n\n            计算强度：74,378.8 GFLOP\u002Fs\n\n\n\n=============================\n...\n```\n\n### 构建单个 CUDA Core GEMM 内核\n\n若要编译针对 NVIDIA Ampere 和 Turing 架构的单个 SGEMM 内核，请使用以下 CMake 命令行：\n```bash\n$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1\n...\n$ make cutlass_profiler -j16\n```\n\n用于分析单个 SGEMM CUDA 核的命令行示例如下：\n```bash\n$ .\u002Ftools\u002Fprofiler\u002Fcutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096\n\n=============================\n  问题编号：1\n\n        提供者：CUTLASS\n   操作类型：GEMM\n       操作：cutlass_simt_sgemm_128x128_8x2_nn_align1\n\n          状态：成功\n    验证：开启\n     处理结果：通过\n\n          cuBLAS：通过\n\n       参数：--m=3456 --n=4096 --k=4096 --A=f32:列 --B=f32:列 --C=f32:列 --alpha=1 --beta=0 --split_k_slices=1  \\\n                  --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \\\n                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024\n\n           字节数：180,355,072 字节\n           浮点运算次数：115,992,428,544 次浮点运算\n\n         运行时间：6.73655 毫秒\n          内存带宽：24.934 GiB\u002Fs\n\n            计算强度：17,218.4 GFLOP\u002Fs\n\n=============================\n```\n\n### 构建部分 Tensor Core 卷积核\n\n若要编译针对 NVIDIA Ampere 和 Turing 架构、采用 FP32 累加且 FP16 输入、实现前向传播（fprop）的部分 Tensor Core 卷积核，请使用以下 CMake 命令行：\n```bash\n$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16\n...\n$ make cutlass_profiler -j16\n```\n\n用于分析部分 Tensor Core 卷积核的命令行示例如下：\n```bash\n$ .\u002Ftools\u002Fprofiler\u002Fcutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3\n\n...\n=============================\n  问题编号：1\n\n        提供者：CUTLASS\n   操作类型：2D 卷积\n       操作：cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc\n\n          状态：成功\n    验证：开启\n     处理结果：通过\n\nreference_device：通过\n\n       参数：--conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \\\n                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc  \\\n                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=串行 --split_k_slices=1  \\\n                  --eq_gemm_provider=无 --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5  \\\n                  --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024\n\n           字节数：1,130,659,840 字节\n           浮点运算次数：118,482,796,544 次浮点运算\n\n         运行时间：0.711496 毫秒\n          内存带宽：1,479.99 GiB\u002Fs\n\n            计算强度：166,526 GFLOP\u002Fs\n\n=============================\n...\n```\n\n### 构建一个卷积 CUDA 内核\n\n要编译并运行一个使用 F32 累加和 FP32 输入、针对 NVIDIA Ampere 和 Turing 架构实现前向传播（fprop）的 CUDA Core 卷积内核，请使用以下 CMake 命令行：\n```bash\n$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc\n...\n$ make cutlass_profiler -j16\n```\n\n用于分析一个 CUDA Core 卷积内核的示例命令行如下：\n```bash\n$ .\u002Ftools\u002Fprofiler\u002Fcutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3\n\n\n=============================\n  问题 ID：1\n\n        提供者：CUTLASS\n   操作类型：conv2d\n       操作：cutlass_simt_sfprop_optimized_128x128_8x2_nhwc\n\n          状态：成功\n    验证：开启\n     处理结果：通过\n\nreference_device：通过\n\n       参数：--conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \\\n                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc  \\\n                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \\\n                  --eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \\\n                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024\n\n           字节数：2055798784 字节\n           浮点运算次数：118482796544 次浮点运算\n\n         运行时间：7.34266 毫秒\n          内存带宽：260.752 GiB\u002F秒\n\n            计算性能：16136.2 GFLOPS\u002F秒\n\n\n=============================\n\n```\n\n## 编译 CUTLASS 内核和 CUTLASS Profiler 的更多细节\n- 请参阅以下链接，获取有关选择性编译 CUTLASS 内核的更多 CMake 示例：\n  - [GEMM CMake 示例](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html#gemm-cmake-examples)\n  - [隐式 GEMM 卷积 CMake 示例](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html#convolution-cmake-examples)\n- [关于 CUTLASS Profiler 的更多详细信息请见此处。](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fprofiler.html)\n\n\n# 关于\n\nCUTLASS 由 NVIDIA 公司以开源软件的形式发布，采用\n[三条款“新”BSD 许可证](LICENSE.txt)。\n\n# 贡献者\n\nCUTLASS 的官方开发者和贡献者列表可在以下位置找到：[CONTRIBUTORS](CONTRIBUTORS.md)。\n\n# 版权\n版权所有 © 2017 - 2026 NVIDIA CORPORATION & AFFILIATES。保留所有权利。\nSPDX 许可证标识：BSD-3-Clause\n\n```\n  重新分发和使用源代码及二进制文件，无论是否修改，均在满足以下条件时被允许：\n\n  1. 源代码的再分发必须保留上述版权声明、本条件列表以及以下免责声明。\n\n  2. 二进制形式的再分发必须在随附的文档或其他材料中复制上述版权声明、本条件列表以及以下免责声明。\n\n  3. 未经事先书面许可，不得使用版权持有者的名称或其贡献者的名称来认可或推广由此软件衍生的产品。\n\n  本软件按“原样”提供给版权所有者和贡献者，不提供任何明示或暗示的保证，包括但不限于适销性和特定用途适用性的暗示保证。在任何情况下，版权所有者或贡献者均不对直接、间接、偶然、特殊、示范性或后果性损害承担责任（包括但不限于替代品或服务的采购、使用损失、数据丢失、利润损失或业务中断）。即使已被告知发生此类损害的可能性，亦不承担任何责任。\n```","# CUTLASS 快速上手指南\n\nCUTLASS (CUDA Templates for Linear Algebra Subroutines) 是 NVIDIA 开源的高性能矩阵乘法（GEMM）及相关计算的 C++ 模板库。它支持从 Volta 到 Blackwell 的全系列 NVIDIA GPU 架构，并提供 C++ 模板抽象和全新的 Python CuTe DSL 接口。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下最低要求：\n\n### 系统要求\n*   **操作系统**: 推荐 Ubuntu 18.04\u002F20.04\u002F22.04。\n    *   *注意*: 目前 Windows 平台构建已知存在问题，建议使用 Linux。\n*   **编译器**: 支持 C++17 标准的宿主编译器。\n    *   推荐：GCC 9.0 或更高版本（GCC 7.5.0 可用，但避免使用 GCC 8.5.0）。\n*   **CUDA Toolkit**:\n    *   最低版本：11.4\n    *   推荐版本：**CUDA 12.8** (以获得最佳性能和 Blackwell 架构支持)。\n\n### 硬件支持\nCUTLASS 支持以下架构的 NVIDIA GPU：\n*   Volta (V100), Turing (T4, RTX 20xx), Ampere (A100, RTX 30xx), Ada (RTX 40xx), Hopper (H100), Blackwell (B200, RTX 50xx)。\n*   **重要提示**: 针对 Hopper (SM90) 和 Blackwell (SM100\u002FSM120) 架构的新特性（如异步 Warp 组矩阵指令），编译时必须指定带 `a` 后缀的目标架构（例如 `90a`, `100a`），否则运行时会报错。\n\n### 前置依赖\n*   **CMake**: 3.18 或更高版本。\n*   **Git**: 用于克隆代码库。\n*   **(可选) Python**: 如果您计划使用新的 **CuTe DSL** (Python 接口)，需安装 Python 3.8+ 及相关依赖。\n\n---\n\n## 安装步骤\n\n### 1. 克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass.git\ncd cutlass\n```\n\n### 2. 创建构建目录\n```bash\nmkdir build && cd build\n```\n\n### 3. 配置构建 (CMake)\n\n根据您的目标 GPU 架构选择配置命令。\n\n#### 场景 A：通用构建 (兼容大多数旧架构)\n```bash\ncmake .. -DCUTLASS_NVCC_ARCHS=\"70;75;80;86\"\n```\n\n#### 场景 B：针对 Hopper (H100\u002FH200) 优化\n必须使用 `90a` 以启用架构加速特性。\n```bash\ncmake .. -DCUTLASS_NVCC_ARCHS=\"90a\"\n```\n\n#### 场景 C：针对 Blackwell (B200\u002FRTX 50xx) 优化\n数据中心卡 (SM100) 和消费级卡 (SM120) 需区分：\n```bash\n# 针对 B200\u002FH200 等数据中心卡\ncmake .. -DCUTLASS_NVCC_ARCHS=\"100a\"\n\n# 针对 RTX 5090\u002F5080 等消费级卡\ncmake .. -DCUTLASS_NVCC_ARCHS=\"120\"\n```\n\n> **提示**: 您可以组合多个架构，例如 `-DCUTLASS_NVCC_ARCHS=\"80;90a;100a\"`。\n\n### 4. 编译\n利用多核加速编译过程（将 `\u003Cjobs>` 替换为您的 CPU 核心数，如 `8` 或 `16`）：\n```bash\nmake -j\u003Cjobs>\n```\n\n*(注：如果您仅需使用头文件或 Python DSL，通常无需完整编译所有示例，但构建测试用例可验证环境正确性。)*\n\n---\n\n## 基本使用\n\nCUTLASS 提供两种主要使用方式：传统的 **C++ 模板 API** 和新推出的 **CuTe DSL (Python)**。\n\n### 方式一：C++ 模板 API (高性能生产环境)\n\nCUTLASS 主要通过 C++ 模板实例化来生成内核。以下是一个最简单的调用现有 GEMM 操作的逻辑示例。\n\n**代码示例 (`simple_gemm.cu`):**\n\n```cpp\n#include \"cutlass\u002Fgemm\u002Fdevice\u002Fgemm.h\"\n\nint main() {\n    \u002F\u002F 定义 GEMM 操作类型\n    using Gemm = cutlass::gemm::device::Gemm\u003C\n        float,                              \u002F\u002F ElementA\n        cutlass::layout::RowMajor,          \u002F\u002F LayoutA\n        float,                              \u002F\u002F ElementB\n        cutlass::layout::RowMajor,          \u002F\u002F LayoutB\n        float,                              \u002F\u002F ElementC\n        cutlass::layout::RowMajor,          \u002F\u002F LayoutC\n        float                               \u002F\u002F ElementAccumulator\n    >;\n\n    \u002F\u002F 定义问题规模 (M, N, K)\n    Gemm::Arguments args{\n        {512, 512, 512},                    \u002F\u002F Problem Size\n        {nullptr, 512},                     \u002F\u002F Ptr A & Leading Dim\n        {nullptr, 512},                     \u002F\u002F Ptr B & Leading Dim\n        {nullptr, 512},                     \u002F\u002F Ptr C & Leading Dim\n        {nullptr, 512},                     \u002F\u002F Ptr D & Leading Dim\n        {1.0f, 0.0f}                        \u002F\u002F Epilogue: alpha, beta\n    };\n\n    \u002F\u002F 实例化并检查支持性\n    Gemm gemm_op;\n    auto status = gemm_op.can_implement(args);\n    if (status != cutlass::Status::kSuccess) {\n        return -1;\n    }\n\n    \u002F\u002F 启动计算 (实际使用时需分配显存并传入指针)\n    status = gemm_op(args);\n    \n    return (status == cutlass::Status::kSuccess) ? 0 : -1;\n}\n```\n\n**编译命令:**\n```bash\nnvcc -std=c++17 -I..\u002Finclude simple_gemm.cu -o simple_gemm\n```\n\n### 方式二：CuTe DSL (Python 快速原型)\n\nCUTLASS 4.x 引入了 Python 原生接口，无需编写 C++ 胶水代码即可定义高性能内核。\n\n**前置准备:**\n确保已安装相关 Python 依赖（通常在 `python\u002F` 目录下有 `setup.py` 或 `requirements.txt`）。\n\n**代码示例 (`simple_gemm.py`):**\n\n```python\nimport cute\nimport cutlass\n\n# 定义布局 (Layout) 和张量 (Tensor) 概念\n# 这里演示如何声明一个简单的矩阵乘法逻辑结构\ndef run_gemm():\n    # 定义矩阵维度\n    M, N, K = 128, 128, 64\n    \n    # 使用 CuTe DSL 定义操作 (伪代码示意，具体 API 随版本迭代可能微调)\n    # 详细用法请参考官方 CuTe DSL Quick Start\n    print(f\"Initializing GEMM for {M}x{N}x{K} using CuTe DSL...\")\n    \n    # 在实际环境中，这里会调用 cutlass.dsl 下的具体算子构建器\n    # 例如: kernel = cutlass.dsl.gemm(...)\n    \n    print(\"Kernel definition complete. Ready for compilation\u002Fexecution.\")\n\nif __name__ == \"__main__\":\n    run_gemm()\n```\n\n> **进阶学习**:\n> *   C++ 详细文档：[CUTLASS C++ Quick Start](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002Fcpp\u002Fquickstart.html)\n> *   Python DSL 教程：[CuTe DSL Quick Start](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002FpythonDSL\u002Fquick_start.html)\n> *   示例代码库：查看 `examples\u002F` 目录，其中包含针对不同架构（如 `95_blackwell_gemm`）的具体实现。","某自动驾驶研发团队正在为新一代 Blackwell 架构 GPU 定制低精度（MXFP4）矩阵乘法内核，以加速实时感知模型的推理速度。\n\n### 没有 cutlass 时\n- 工程师必须深入编写复杂的 CUDA C++ 模板元编程代码，手动管理线程层级和数据分块，开发门槛极高且容易出错。\n- 每次调整算法策略或数据类型都需要重新编译庞大的 C++ 工程，单次迭代耗时数分钟，严重拖慢原型验证节奏。\n- 难以直接利用最新的块缩放数据类型（如 MXFP4），缺乏现成抽象导致需从零实现底层数据搬运逻辑。\n- 与 PyTorch 等深度学习框架集成时需编写大量“胶水代码”，增加了维护负担和运行时开销。\n\n### 使用 cutlass 后\n- 团队利用 CuTe DSL 通过 Python 原生接口即可定义高性能内核，无需精通 C++ 模板技巧，大幅降低了开发难度。\n- 修改算子配置后可在秒级内完成编译并立即测试，实现了 orders of magnitude 更快的迭代效率，快速锁定最优参数。\n- 直接调用 cutlass 内置的 MXFP4 及 Tensor Core 专用抽象，自动处理异步数据拷贝，确保在黑石架构上跑满硬件算力。\n- 生成的内核可无缝嵌入现有训练流程，消除了额外的集成代码，让研究人员能专注于算法创新而非底层优化。\n\ncutlass 通过 Python DSL 与模块化 C++ 抽象的结合，将原本需要数周的高性能算子开发周期缩短至几天，真正实现了从原型到生产的平滑过渡。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_cutlass_7c137dc7.png","NVIDIA","NVIDIA Corporation","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNVIDIA_7dcf6000.png","",null,"https:\u002F\u002Fnvidia.com","https:\u002F\u002Fgithub.com\u002FNVIDIA",[80,84,88,92,96,100,104,108,111],{"name":81,"color":82,"percentage":83},"C++","#f34b7d",60.5,{"name":85,"color":86,"percentage":87},"Cuda","#3A4E3A",28.5,{"name":89,"color":90,"percentage":91},"Python","#3572A5",8.6,{"name":93,"color":94,"percentage":95},"HTML","#e34c26",1.7,{"name":97,"color":98,"percentage":99},"CMake","#DA3434",0.6,{"name":101,"color":102,"percentage":103},"C","#555555",0.1,{"name":105,"color":106,"percentage":107},"Shell","#89e051",0,{"name":109,"color":110,"percentage":107},"Batchfile","#C1F12E",{"name":112,"color":113,"percentage":107},"Makefile","#427819",9558,1782,"2026-04-11T07:11:48","NOASSERTION",4,"Linux (Ubuntu 18.04, 20.04, 22.04)","必需 NVIDIA GPU (Volta 架构及以上，如 V100, A100, H100, B200, RTX 30\u002F40\u002F50 系列等)。显存大小未说明。需安装 CUDA Toolkit 11.4+ (推荐 12.8)，针对 Hopper\u002FBlackwell 架构的新特性需特定目标架构 (如 sm_90a, sm_100a)。","未说明",{"notes":123,"python":124,"dependencies":125},"1. Windows 平台目前不支持构建 (团队正在修复)。2. 主机编译器必须支持 C++17，避免使用 GCC 8.5.0 (存在已知回归问题)，推荐使用 GCC 7.5.0 或 GCC >=9。3. 对于 Hopper (GH100) 和 Blackwell 架构，编译时必须指定带 'a' 后缀的目标架构 (如 -DCUTLASS_NVCC_ARCHS=\"90a\") 以启用加速特性，否则运行时会报错。4. Blackwell 数据中心卡 (SM100) 与 GeForce RTX 50 系列 (SM120) 的算力不同，编译的内核不通用。","未说明 (注：CuTe DSL 为 Python 原生接口，但文中未指定具体版本)",[126,127,97],"CUDA Toolkit (>=11.4, 推荐 12.8)","C++ 编译器 (支持 C++17, 推荐 GCC >=9)",[14],[130,131,132,133,134,135,136],"cuda","deep-learning","deep-learning-library","cpp","nvidia","gpu","python","2026-03-27T02:49:30.150509","2026-04-11T20:31:58.357803",[140,145,150,155,160,165],{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},29781,"如何在 CUTLASS GEMM 操作中指定自定义 CUDA 流（stream）而不是默认流？","可以直接将 stream 对象传递给 gemm 操作符。代码示例如下：\ncutlass::Status status = gemm_operator(stream);\n其中 stream 是你创建的 cudaStream_t 类型的流对象。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fissues\u002F922",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},29782,"CUTLASS 是否支持在 Conv2dFprop 中处理带有 stride（步长）的输出张量？如果忽略 stride 导致填充错误怎么办？","早期版本可能存在忽略输出 stride 的问题。解决方案包括：\n1. 参考官方示例代码切换使用 packed 或 non-packed 的 conv epilogue，其中 non-packed 模式允许填充 0。\n示例参考：https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fblob\u002Fmain\u002Fexamples\u002F16_ampere_tensorop_conv2dfprop\u002Fampere_tensorop_conv2dfprop.cu#L269\n2. 维护者已确认开发了新的 epilogue 以尊重输出 stride，该功能将在 3.5.1 版本或之后发布。建议升级至最新版本或使用开发分支代码。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fissues\u002F1323",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},29783,"如何使用 CUTLASS 实现包含除法和加法操作的 BatchNorm（批归一化）层作为 Epilogue？","可以通过自定义 Epilogue 来实现复杂的逐元素操作。可以参考 CUTLASS 测试代码中 ResNet-50 的实现案例，其中展示了如何处理类似的融合操作。\n参考代码位置：https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fblob\u002Fmaster\u002Ftest\u002Funit\u002Fconv\u002Fdevice\u002Fconv2d_problems.h#L520","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fissues\u002F509",{"id":156,"question_zh":157,"answer_zh":158,"source_url":159},29784,"CUTLASS 是否有计划支持混合数据类型的反量化（de-quantization）以及 int4 数据类型？","社区对此有明确的需求和进展。维护者已创建了相关的 Pull Request 来添加这些功能：\n1. 支持 4-bit (int4) 混合数据类型的 PR：https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fpull\u002F1190\n2. 支持稀疏 GEMM 的 EVT epilogue 的 PR：https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fpull\u002F1189\n建议关注这些 PR 的合并状态以获取最新支持。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fissues\u002F1122",{"id":161,"question_zh":162,"answer_zh":163,"source_url":164},29785,"在使用 CUTLASS 进行两个 GEMM 操作融合（Two Tensor Op Fusion）时，某些特定配置下结果验证失败，如何调试或解决？","当改变 GEMM 坐标参数（如 M, N, K）导致结果错误时，通常是因为隐含的参数（如 threadblock shape, warp shape 等）未针对新尺寸进行优化或不匹配。\n调试建议：\n1. 检查示例代码中的默认配置（如 examples\u002F13_two_tensor_op_fusion\u002F 目录下）。\n2. 尝试调整 GemmShape 相关模板参数以匹配新的问题规模。\n3. 确保累加器类型和数据布局与硬件架构（如 sm75）兼容。\n如果问题持续，可能需要针对特定尺寸手动调优 kernel 配置参数。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fissues\u002F461",{"id":166,"question_zh":167,"answer_zh":168,"source_url":144},29786,"如何限制 GEMM 内核的资源利用率（如寄存器、共享内存或 SM），以便为 GPU 上的其他任务留出资源？","可以通过调整 cutlass::gemm::GemmShape 模板参数来控制线程块的大小和资源占用。减小线程块尺寸（Threadblock Tile Size）通常会减少每个块消耗的寄存器和共享内存，从而释放部分 GPU 资源供其他流或任务使用。需要根据具体的硬件架构（如 A100）和性能需求进行权衡测试。",[170,175,180,185,190,195,200,205,210,215,220,225,230,235,240,245,250,255,260,265],{"id":171,"version":172,"summary_zh":173,"released_at":174},206350,"v4.4.2","### CuTe DSL\n* 新特性\n  - CuTe DSL 现在同时支持 x86_64 和 aarch64 架构上的 Python 3.14。\n  - 运行时指针\u002F张量\u002FFakeTensor 现在支持 __cache_key__，提供了一个稳定且可哈希的表示，从而简化并提升了编译后函数的缓存机制。\n* Bug 修复与改进\n  - 通过优化 mbarrier 同步以避免不必要的收敛屏障，修复了 CUDA 工具包 13.1 下 Hopper FMHA 因果注意力性能退化的问题。\n  - 修复了 JAX 中同一进程中存在多块 GPU 时出现的内核加载竞争条件问题。\n\n### CUTLASS C++\n* 启用 Blackwell SM120f 的示例编译，并在 CUTLASS 性能分析器中公开 NVFP4\u002FMX 分组 GEMM。","2026-03-17T14:55:49",{"id":176,"version":177,"summary_zh":178,"released_at":179},206351,"v4.4.1","### CuTe DSL\n* 修复 bug 和改进\n  - 修复了 aarch64 架构上 tvm-ffi 的段错误问题","2026-02-28T03:30:17",{"id":181,"version":182,"summary_zh":183,"released_at":184},206352,"v4.4.0","### CuTe DSL\n* 新特性\n  - CuTe DSL 现已支持 CUDA 工具包 13.1！\n    + 可通过 cutlass\u002Fpython\u002FCuTeDSL\u002Fsetup.sh --cu13 进行设置\n    + 更多详情请参阅：https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Flatest\u002Fmedia\u002Fdocs\u002FpythonDSL\u002Fquick_start.html\n  - 在 CTK 13.1 中，CuTe DSL 已支持 GB300 架构\n    + 示例内核请参考 [SM103 批处理 3x FP4 块缩放 GEMM 内核](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fblackwell\u002Fsm103_dense_blockscaled_gemm_persistent.py)\n  - cute.experimental：在现有 CuTe DSL API 的基础上引入更高层次、可组合的抽象层（并非独立的抽象），该层可与现有的 Cute DSL 构建块混合使用。\n    + 无片段编程模型：copy\u002Fdot API 直接接收 memref，而非描述符或片段。\n    + 自动 TMA 描述符生成及更新插入。\n    + 自动向量化与 SIMT 复制的预测执行。\n    + 新的流水线抽象及便捷封装。\n    + 新的 Partition 操作，简化分区逻辑。\n    + 设备端 TMA 描述符的分配、初始化与管理。\n    + 示例代码可在以下链接找到：https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fexperimental\n  - 现已支持提前编译（AoT）！\n    + 示例用法请参阅：https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fcute\u002Fexport 下的相关文件\n  - JAX 支持——现在可以将 CuTeDSL 与 JAX 配合使用。\n    + 示例用法请参阅：https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fjax 下的相关文件\n  - 在 DSL 中引入了版本管理支持：\n    + cutlass.__version__ 提供 DSL 版本的字符串表示。\n    + cutlass.CUDA_VERSION 提供一个版本类，用于指示 DSL 所使用的 CUDA 版本。\n  - 新增 CopyDsmemStoreOp，用于将数据存储到分布式共享内存，并进行显式同步。\n  - 分组 GEMM 示例现支持仅在设备上定义的问题规模。\n  - 允许在主机端无需提供问题规模的情况下进行网格划分。\n  - Tma+LdMatrix 功能，用于加载并解包窄宽度类型数据（示例用法请参阅 mixed_input_fmha_decode.py）。\n  - 现可通过 Python Epilogue Fusion Configuration (EFC) 函数为持久化密集 GEMM 定制尾部融合操作，其功能与 CUTLASS C++ EVT 类似。此外，还提供 PyTorch 评估器用于比较结果。\n\n* 更多实现峰值性能内核的示例\n  - [SM103 批处理 3x FP4 块缩放 GEMM 内核](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fblackwell\u002Fsm103_dense_blockscaled_gemm_persistent.py)\n  - 混合输入 FMHA 解码示例，支持 int4 KV（int8 KV 在 4.3 版本中已支持）。\n  - 引入了新的 acc_scale 分组混合输入 GEMM 内核变体，以提升解码场景下的性能。\n  - 所有 mixed_input_gemm 示例已被移至单独的 `mixed_input_gemm` 文件夹。通用工具函数也被提取至 mixe","2026-02-26T04:01:52",{"id":186,"version":187,"summary_zh":188,"released_at":189},206353,"v4.3.5","### CuTe DSL\n* 修复 bug 和改进\n  - 修复了 4.3.4 版本引入的意外 CPU 开销问题\n* 将版权年份更新为 2026 年。\n\n### CUTLASS C++\n* 将版权年份更新为 2026 年。\n* 使用 CUDA 驱动程序的 Get Version Runtime API，而非驱动程序 API。","2026-01-09T06:08:53",{"id":191,"version":192,"summary_zh":193,"released_at":194},206354,"v4.3.4","### CuTe DSL\n* 新特性\n  - 增加了 PDL 支持，并提供了示例 [使用程序化依赖启动的内核](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fblackwell\u002Fprogrammatic_dependent_launch.py)\n\n* 问题修复与改进\n  - 修复了 CUDA 图中的帧引用计数问题\n  - 针对 TVM-FFI 的 AOT 情况，优化了模块的提前卸载\n  - 修复了 utils\u002Fhopper_helpers.py 中 `make_smem_layout_a` 的顺序问题\n\n### CUTLASS C++\n* 临时解决方案：针对驱动程序中与 TMA 描述符相关的一个 bug。该 bug 会导致在 Blackwell 架构上，当张量的底层数组分配小于 128KB 且不是密集型非重叠张量时，偶尔出现错误。\n","2025-12-24T05:49:23",{"id":196,"version":197,"summary_zh":198,"released_at":199},206355,"v4.3.3","### CuTe DSL\n* 新特性\n  - 在 tvm-ffi 中支持使用 namedtuple 和 kwargs 作为 JIT 函数的参数\n  - 在 tvm-ffi 中支持可变参数元组作为 JIT 函数的参数\n\n* 错误修复与改进\n  - 修复了 tvm-ffi 中带有联合类型注解的 JIT 函数参数相关的问题\n  - 对于 runtime error cudaErrorInsufficientDriver 的情况，提供了更清晰的错误信息","2025-12-12T05:12:14",{"id":201,"version":202,"summary_zh":203,"released_at":204},206356,"v4.3.2","### CuTe DSL\n* 新特性\n  - 新增环境变量 `CUTE_DSL_CACHE_DIR`，用于指定缓存的存储路径\n\n* Bug 修复与改进\n  - 修复了 CUDA JitExecutor 在卸载内核时出现的问题\n  - 修复了在存在静态分配的共享内存时，申请最大共享内存容量的问题\n","2025-12-05T18:51:03",{"id":206,"version":207,"summary_zh":208,"released_at":209},206357,"v4.3.1","### CuTe DSL\n* 新特性\n    - 增加了对 Blackwell SM103 的支持\n    - 将 wheel 包中的多个依赖 DSO 合并为一个单独的 DSO\n* 修复与改进\n    - 修复了 tvm-ffi 中的设备重置问题\n    - 修复了 tvm-ffi 导出编译后函数的问题\n\n### CUTLASS C++\n* 在 [示例 92](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F92_blackwell_moe_gemm\u002F) 中，通过新的简化 MoE API，支持块尺度变体的不规则连续分组 GEMM。\n    - 新示例适用于所有微尺度类型。","2025-12-02T03:22:01",{"id":211,"version":212,"summary_zh":213,"released_at":214},206358,"v4.3.0","### CuTe DSL\n* 新特性：\n  - 支持 Apache [TVM-FFI](https:\u002F\u002Ftvm.apache.org\u002Fffi\u002Findex.html)，进一步降低 JIT 函数的主机运行时开销，提升与 PyTorch 和其他机器学习框架的互操作性。\n  - 添加了假张量和流，以解耦编译 JIT 函数与 `from_dlpack` 流程。现在用户在编译 JIT 函数时不再需要真实的张量。\n  - 新增了 `FastDivmodDivisor`，支持 Python 运算符重载、新 API、Cute Dialect 集成，并优化了静态分块调度器的性能，从而加快索引映射速度。\n  - 为 TMA 相关操作添加了 L2 缓存逐出优先级，用户可以进行细粒度的 L2 缓存控制。\n* 调试能力改进：\n    - 支持 DSL API 的源代码位置跟踪（允许像 ``nsight`` 这样的工具将性能指标与 Python 源代码关联起来）。\n    - 支持转储 PTX 和 CUBIN 代码：[Hello World 示例](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fblob\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fnotebooks\u002Fhello_world.ipynb)。\n* 更多示例和笔记本，帮助您快速入门 CuTe DSL：\n    - 改进了 [逐元素运算示例](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fampere\u002Felementwise_apply.py) 的性能：\n        + 代码通用化，可处理输入张量列表。\n        + TV 布局计算通用化，支持不同数据类型。\n    - 改进了 [Blackwell SM100 持久化稠密 GEMM 静态调度实现](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fblackwell\u002Fdense_gemm_persistent.py)：\n        + 展示了新的流水线 API `PipelineProducer` 和 `PipelineConsumer` 的用法，简化代码，无需显式管理流水线状态（旧版 API 仍保留）。\n        + 将非 TMA 和 TMA 实现的尾部处理代码分离。\n    - [Blackwell GEMM 教程：基础 Blackwell SM100 GEMM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fblackwell\u002Ftutorial_gemm)：\n        + [基准 Blackwell GEMM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fblackwell\u002Ftutorial_gemm\u002Ffp16_gemm_0.py) 在 MNK 8K 场景下实现了 84% 的 SOL 性能。\n        + 后续还将提供更多示例，展示优化效果：“基准 + X”。\n    - [异步流水线 API 教程](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fnotebooks\u002Fasync_pipeline.ipynb)。\n    - 重新编写了 [逐元素加法笔记本](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fnotebooks\u002Felementwise_add.ipynb)，增加了更多细节，并对 TV 布局进行了深入解释：\n        + 更新了实现，支持通用数据类型和多个输入。\n        + 用更通俗的语言重新解释了 TV 布局。\n        + 使用第三方工具可视化了 TV 布局。\n    - [基准测试与自动调优演示](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fnotebooks\u002Fbenchmark_autotune.ipynb)。\n* 更多发挥峰值性能内核的例子：\n    - [Blackwell SM100 混合输入","2025-11-24T22:24:16",{"id":216,"version":217,"summary_zh":218,"released_at":219},206359,"v4.2.1","### CuTe DSL\n* 错误修复和改进\n    - 修复了在使用 cuda-python 13.0 运行 DSL 代码时出现的问题\n    - 修复了在使用 Inductor 运行 DSL 代码时出现的问题\n    - 修复了在 FlashInfer 中运行 DSL 代码时出现的意外日志记录问题\n    - 修复了 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fissues\u002F2647 中报告的问题\n    - 修复了在动态控制流之外对变量进行条件定义时出现的问题\n\n### CUTLASS C++\n* 在 Blackwell 架构上，对于 nosmem 分块内核绕过 EVT。\n* 将 cutlass\u002Fpython\u002Fcutlass 目录重命名为 cutlass\u002Fpython\u002Fcutlass_cppgen。\n","2025-09-24T05:23:13",{"id":221,"version":222,"summary_zh":223,"released_at":224},206360,"v4.2.0","### CuTe DSL\r\n* More Python versions are now supported for both x86-64 and aarch64, including\r\n    - Python 3.10, 3.11, 3.12, and 3.13\r\n* Added new example and updated notebook to get started with CuTe DSL\r\n    - [Call kernels with dlpack bypassed](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fampere\u002Fcall_bypass_dlpack.py)\r\n    - Updates on [TensorSSA demonstration](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fnotebooks\u002Ftensorssa.ipynb)\r\n      + Added a section for introducing the broadcast\r\n* API updates\r\n    - Please refer to [DSL API changelog](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Fmedia\u002Fdocs\u002FpythonDSL\u002Fcute_dsl_api\u002Fchangelog.html) for details\r\n* Bug fixings and improvements\r\n    - Fixed ``cute.print_tensor`` for coordinate tensor\r\n    - Fixed `cute.print` for tuple of layouts\r\n    - Fixed frozen object is not properly updated after fully assigned in dynamic control flow\r\n    - Fixed assign tuple\u002Flist element in a dynamic control flow may cause compilation failure\r\n    - Improved error message when CUDA context is not initialized\r\n    - Improved docstring of congruent and weakly_congruent\r\n\r\n### CUTLASS C++\r\n* Support for Blackwell SM103 kernels for B300 GPUs.\r\n    - Collective mainloop codes: [Blockscaled datatypes with support for dense GEMM mainloop](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Finclude\u002Fcutlass\u002Fgemm\u002Fcollective\u002Fsm103_blockscaled_mma_warpspecialized.hpp)\r\n    - New [GEMM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Finclude\u002Fcutlass\u002Fgemm\u002Fdispatch_policy.hpp) and [epilogue](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Finclude\u002Fcutlass\u002Fepilogue\u002Fdispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.\r\n    - Kernel codes: [Blockscaled datatypes with support for dense GEMM kernel](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Finclude\u002Fcutlass\u002Fgemm\u002Fkernel\u002Fsm103_blockscaled_gemm_tma_warpspecialized.hpp).\r\n* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture:\r\n    - [Blockscaled ultra fp4 dense GEMM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F89_sm103_fp4_ultra_gemm\u002F).\r\n    - [Blockscaled ultra fp4 dense grouped GEMM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F90_sm103_fp4_ultra_grouped_gemm).\r\n* Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM\r\n    - Unit test files with prefix name of `sm103_` under [GEMM device unit tests](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Ftest\u002Funit\u002Fgemm\u002Fdevice\u002F).\r\n* Support for Blackwell SM121 kernels for DGX Spark GPUs.\r\n    - Share the major codes with Blackwell SM120 kernels.\r\n* Add support for heuristics-based kernel filtering and autotuning using `nvidia-matmul-heuristics` to find the best kernels for a given scenario.\r\n    - Details please refer to [heuristics doc](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fmedia\u002Fdocs\u002Fcpp\u002Fheuristics.md).\r\n* Further enhance Blackwell SM100 Attention kernels in [example 77](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F77_blackwell_fmha\u002F).\r\n    - Add fused reduction kernel support for cutlass MLA.\r\n    - Add softmax skip correction.\r\n    - Support for GQA in FMHA backward kernel.\r\n    - Fix an issue where `get_unmasked_trip_count` may return a negative value.\r\n    - Fix an issue where mbarriers are initialized with a zero arrival count.\r\n    - Fix a corner case issue where the sequence length of q is not a multiple of tile_q.\r\n    - Remove tma padding for forward kernel inputs.\r\n* Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): [example 92](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F92_blackwell_moe_gemm\u002F).  It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups\u002Fexperts, unlike general Grouped GEMMs.  Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome.\r\n* Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell\r\n    - On Blackwell SM120, a blockwise gemm kernel is added: [example 87](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F87_blackwell_geforce_gemm_blockwise\u002F).\r\n    - On Hopper, add K major scale factor support for SM90 blockwise kernels.\r\n    - On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size.\r\n    - On Hopper, grouped version supports the case when k = 0.\r\n* Support for Blackwell SM100 fp4 gemv kernels.\r\n    - Kernel codes: [Gemv kernel](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Finclude\u002Fcutlass\u002Fgemm\u002Fkernel\u002Fgemv_blockscaled.h).\r\n    - Example codes: [example 91](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F91_fp4_gemv\u002F)\r\n* Support for Blackwell SM100 legacy mixed input GEMM kernels.\r\n    - Collective mainloop codes: [Mixed input mainloop](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Finclude\u002Fcutlass\u002Fgemm\u002Fcollective\u002Fsm100_mma_warpspecialized_mixed_input.hpp).\r\n    - Kerne","2025-09-18T03:32:52",{"id":226,"version":227,"summary_zh":228,"released_at":229},206361,"v4.1.0","**CuTe DSL**\r\n* Add aarch64 support, you can now pip install `nvidia-cutlass-dsl` on GB200 systems!\r\n* More examples demonstrating how to use CuTe DSL to write peak-performance kernels\r\n    - [Blackwell Mamba2 SSD](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fblackwell\u002Fmamba2_ssd\u002Fmamba2_ssd.py)\r\n    - [Blackwell SM100 persistent dense blockscaled GEMM with static scheduling](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fblackwell\u002Fdense_blockscaled_gemm_persistent.py)\r\n* API updates\r\n    - Please refer to [FUNCTIONALITY.md](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Fblob\u002Fmain\u002FFUNCTIONALITY.md) for details\r\n\r\n**CUTLASS C++**\r\n* Further enhance Blackwell SM100 Attention kernels in [example 77](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F77_blackwell_fmha\u002F).\r\n    - Add variable sequence length support for FMHA Backward kernel.\r\n    - Add varlen test support to Backward runner.\r\n    - Codes support empty batch sequences.\r\n* Replace `subbyte_iterator` with `cute::recast_ptr` when constructing logical iterators\u002Farrays.\r\n* CuTe changes:\r\n    - Rewrite ArithTuple and ScaledBasis for robustness and clarity.\r\n    - Remove buggy and kludgy `get_layoutA|B|C_MN` and friends from Atoms\u002FTiledX.\r\n    - Factor out `print_latex` and friends and rewrite.\r\n    - Factor out `print_svg` and friends and rewrite.\r\n* Support Blackwell SM100 SIMT packed fp32x2 kernels.\r\n* Support residual add for implicit gemm kernels.\r\n* Various fixes for CUTLASS C++ Python interface's EVT tracer:\r\n    - Add verifier for sm90 to report the invalid input.\r\n    - When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.\r\n    - Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.\r\n    - Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.\r\n* Fix profiler bugs in exhaustive perf search.\r\n    - Fix incorrect cluster shape output issue when doing exhaustive search.\r\n    - Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.\r\n* Fix some profiler issues.\r\n    - Complete the reference for Blackwell blockwise gemm kernels.\r\n    - Fix incorrect regex logic for L1 test.","2025-07-28T03:57:01",{"id":231,"version":232,"summary_zh":233,"released_at":234},206362,"v4.0.0","**CuTe DSL**\r\n\r\nCuTe DSL is a Python DSL centered around CuTe's abstractions\r\n- Enables authoring kernels in Python to reach peak performance on NVIDIA GPUs\r\n- [Core DSL implementation files](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fpython\u002FCuTeDSL)\r\n- [DSL quick start](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Fmedia\u002Fdocs\u002FpythonDSL\u002Fquick_start.html)\r\n- [DSL Overview](https:\u002F\u002Fdocs.nvidia.com\u002Fcutlass\u002Fmedia\u002Fdocs\u002FpythonDSL\u002Foverview.html)\r\n- [Educational notebooks for getting started with CuTe DSL](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002Fpython\u002FCuTeDSL\u002Fnotebooks)\r\n\r\n\r\n**CUTLASS C++**\r\n\r\n- Support [Family Specific Architecture Features](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fnvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features\u002F) which was introduced in CUDA 12.9\r\n- Further improved [Blockwise](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) and [Groupwise](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling\u002F67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) GEMMs on Hopper and Blackwell\r\n- Enhance Blackwell SM100 Attention kernels in [example 77](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F77_blackwell_fmha\u002F)\r\n- Add [Blackwell SM100 implicit GEMM conv fprop\u002Fdgrad\u002Fwgrad unit tests](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Ftest\u002Funit\u002Fconv\u002Fdevice_3x\u002F)\r\n- New [Hopper SM90 FMHA example](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F88_hopper_fmha\u002F), similar in design to the existing [Blackwell FMHA](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Fexamples\u002F77_blackwell_fmha\u002F)\r\n- Cute enhancements: [CuTe C++ reduce op](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\u002Ftree\u002Fmain\u002Finclude\u002Fcute\u002Falgorithm\u002Ftensor_reduce.hpp) \r\n- Other functional and performance enhancements","2025-06-27T14:17:40",{"id":236,"version":237,"summary_zh":238,"released_at":239},206363,"v3.9.2","* Fixed [Blockwise](.\u002Fexamples\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) and [Groupwise](.\u002Fexamples\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling\u002F67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) GEMM hang issue when problem size K is 128.\r\n* Optimal code generation with CUDA toolkit versions 12.9.","2025-05-04T04:25:21",{"id":241,"version":242,"summary_zh":243,"released_at":244},206364,"v3.9.1","* Fixed Group Gemm hang issue in CUTLASS 3.x\r\n* Improved Hopper [Blockwise](.\u002Fexamples\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) and [Groupwise](.\u002Fexamples\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling\u002F67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) GEMM performance.","2025-05-01T04:29:36",{"id":246,"version":247,"summary_zh":248,"released_at":249},206365,"v3.9.0","* Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:\r\n  - Collective mainloops that target for:\r\n    * [Blockscaled datatypes with support for dense GEMM](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fcollective\u002Fsm120_blockscaled_mma_tma.hpp)\r\n    * [Blockscaled datatypes with support for sparse GEMM](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fcollective\u002Fsm120_blockscaled_sparse_mma_tma.hpp)\r\n  - New [GEMM](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fdispatch_policy.hpp) and [epilogue](.\u002Finclude\u002Fcutlass\u002Fepilogue\u002Fdispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.\r\n  - [Blackwell SM120 epilogue](.\u002Finclude\u002Fcutlass\u002Fepilogue\u002Ffusion\u002Fsm120_visitor_store_tma_warpspecialized.hpp) and [full set of EVT fusions](.\u002Finclude\u002Fcutlass\u002Fepilogue\u002Ffusion\u002Fsm120_callbacks_tma_warpspecialized.hpp).\r\n* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:\r\n  - [Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor](.\u002Fexamples\u002F79_blackwell_geforce_gemm\u002F79a_blackwell_geforce_nvfp4_bf16_gemm.cu).\r\n  - [Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation](.\u002Fexamples\u002F79_blackwell_geforce_gemm\u002F79b_blackwell_geforce_nvfp4_nvfp4_gemm.cu).\r\n  - [Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor](.\u002Fexamples\u002F79_blackwell_geforce_gemm\u002F79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu).\r\n  - [Grouped GEMM with nvfp4 datatype](.\u002Fexamples\u002F79_blackwell_geforce_gemm\u002F79d_blackwell_geforce_nvfp4_grouped_gemm.cu).\r\n  - [Sparse Blockscaled GEMM with mxfp8 input datatype and BF16 output tensor](.\u002Fexamples\u002F80_blackwell_geforce_sparse_gemm\u002F80a_blackwell_geforce_mxfp8_bf16_sparse_gemm.cu).\r\n  - [Sparse Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor](.\u002Fexamples\u002F80_blackwell_geforce_sparse_gemm\u002F80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm.cu).\r\n* Set of unit tests that demonstrate the usage of both [sparse](.\u002Ftest\u002Funit\u002Fgemm\u002Fdevice\u002Fsm120_blockscaled_sparse_tensorop_gemm\u002F) and [dense](.\u002Ftest\u002Funit\u002Fgemm\u002Fdevice\u002Fsm120_blockscaled_tensorop_gemm\u002F) Blackwell SM120 blockscaled GEMM.\r\n* Support for Blackwell SM100 Sparse kernels:\r\n  - Collective mainloop that target for\r\n    * [SM100 Sparse GEMM](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fcollective\u002Fsm100_sparse_mma_warpspecialized.hpp)\r\n* Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM:\r\n  - [Sparse GEMM](.\u002Fexamples\u002F83_blackwell_sparse_gemm\u002F83_blackwell_sparse_gemm.cu)\r\n  - [Blockscaled Sparse GEMM with NVFP4 input data type](.\u002Fexamples\u002F84_blackwell_narrow_precision_sparse_gemm\u002F84a_blackwell_nvfp4_bf16_sparse_gemm.cu)\r\n  - [Blockscaled Sparse GEMM with mixed input data type (MXFP8 and MXFP4)](.\u002Fexamples\u002F84_blackwell_narrow_precision_sparse_gemm\u002F84b_blackwell_mixed_mxfp8_bf16_sparse_gemm.cu)\r\n* Set of unit tests that demonstrate the usage of [sparse](.\u002Ftest\u002Funit\u002Fgemm\u002Fdevice\u002Fsm100_sparse_tensorop_gemm) and [blockscaled sparse](.\u002Ftest\u002Funit\u002Fgemm\u002Fdevice\u002Fsm100_blockscaled_sparse_tensorop_gemm) Blackwell SM100 GEMM.\r\n* A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS [example](.\u002Fexamples\u002F77_blackwell_fmha\u002F) covers the flashMLA-like weight-absorbed decoding use-case.\r\n* A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS [example](.\u002Fexamples\u002F77_blackwell_fmha\u002F) to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.\r\n* A new [distributed GEMM example](.\u002Fexamples\u002F82_blackwell_distributed_gemm\u002F82_blackwell_distributed_gemm.cu) for SM100 Blackwell architecture.\r\n* Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:\r\n  - Enhancement of [blockwise GEMM](.\u002Fexamples\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) for Hopper architecture.\r\n  - Enhancement of [groupwise GEMM](.\u002Fexamples\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling\u002F67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) for Hopper architecture.\r\n  - Support for [grouped GEMM with blockwise and groupwise scaling](.\u002Fexamples\u002F68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling\u002F) for Hopper architecture.\r\n  - Support for [grouped-wise GEMM](.\u002Ftools\u002Fprofiler\u002Fsrc\u002Fblockwise_gemm_operation_profiler.cu) in CUTLASS profiler.\r\n  - Support for [blockwise GEMM](.\u002Fexamples\u002F81_blackwell_gemm_blockwise\u002F81_blackwell_gemm_blockwise.cu) for Blackwell architecture.\r\n  - Support for [groupwise GEMM](.\u002Fexamples\u002F81_blackwell_gemm_blockwise\u002F81_blackwell_gemm_groupwise.cu) for Blackwell architecture.\r\n  - Support for [grouped GEMM with blockwise](.\u002Fexamples\u002F81_blackwell_gemm_blockwise\u002F81_blackwell_grouped_gemm_blockwise.cu) and [groupwise scaling](.\u002Fexamples\u002F81_blackwell_gemm_blockwise\u002F81_blackwell_grouped_gemm_groupwise.cu) for Blackwell architecture.\r\n* Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:\r\n  - Sort","2025-04-25T01:53:42",{"id":251,"version":252,"summary_zh":253,"released_at":254},206366,"v3.8.0","\r\nCUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture.\r\nFor a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.\r\n\r\n* Support for new CuTe building blocks specifically for Blackwell SM100 architecture:\r\n  - [5th generation Blackwell Tensor Core instructions (TCGen05)](.\u002Finclude\u002Fcute\u002Fatom\u002Fmma_traits_sm100.hpp) via CuTe MMA atoms.\r\n  - Extensions to [Tensor Memory Accelerator](.\u002Finclude\u002Fcute\u002Fatom\u002Fcopy_traits_sm100_tma.hpp) via CuTe Copy atoms.\r\n  - Exposure of Blackwell's new tensor memory (note: distinct from TMA) as [`tmem`](.\u002Finclude\u002Fcute\u002Fpointer.hpp) across CuTe as a first class data locale.\r\n  - Exposure of [`tmem->rmem`, `rmem->tmem` and `smem->tmem data movement instructions`](.\u002Finclude\u002Fcute\u002Fatom\u002Fcopy_traits_sm100.hpp) as copy atoms in CuTe.\r\n  - [`make_tmem_copy()`](.\u002Finclude\u002Fcute\u002Fatom\u002Fcopy_traits_sm100.hpp) utility method to ease creation of tiled copies for tmem copy atoms.\r\n  - Support for [new variants of LDSM on Blackwell](.\u002Finclude\u002Fcute\u002Fatom\u002Fcopy_traits_sm100.hpp) via CuTe Copy atoms.\r\n* Support for new CUTLASS building blocks specifically for Blackwell SM100 architecture:\r\n  - Various narrow precision [FP4, FP6, and FP8](.\u002Finclude\u002Fcutlass\u002Fexmy_base.h) formats as well as their [block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8](.\u002Finclude\u002Fcutlass\u002Ffloat_subbyte.h)\r\n  - [Pipelines that implement Blackwell specific synchronization](.\u002Finclude\u002Fcutlass\u002Fpipeline\u002Fsm100_pipeline.hpp).\r\n  - [Cluster launch control API supporting preferred and fallback cluster shapes](.\u002Finclude\u002Fcutlass\u002Fcluster_launch.hpp).\r\n  - Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.\r\n  - Tile schedulers using [Blackwell's Cluster Launch Control (CLC) feature](.\u002Fmedia\u002Fdocs\u002Fblackwell_cluster_launch_control.md) to implement dynamic persistence scheduling for [GEMMs](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fkernel\u002Fsm100_tile_scheduler.hpp), and [stream-K](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fkernel\u002Fsm100_tile_scheduler_stream_k.hpp).\r\n  - Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.\r\n* Full support for Blackwell SM100 kernels in CUTLASS 3.x API:\r\n  - [Blackwell specific kernel layers](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fkernel\u002Fsm100_gemm_tma_warpspecialized.hpp) that\r\n    + Implement a new warp-specialization recipe tuned specifically for Blackwell SM100 architecture.\r\n    + Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators.\r\n    + Support stream-K load balancing for all kernel types everywhere via composable scheduler support.\r\n  - Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for\r\n    * [Non-block scaled data types without support for pointer array and grouped GEMM with TMA](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fcollective\u002Fsm100_mma_warpspecialized.hpp)\r\n    * [Non-block scaled data types with support for pointer array and grouped GEMM with TMA](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fcollective\u002Fsm100_mma_array_warpspecialized.hpp)\r\n    * [Block scaled data types without support for pointer array and grouped GEMM with TMA](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fcollective\u002Fsm100_blockscaled_mma_warpspecialized.hpp)\r\n    * [Block scaled data types with support for pointer array and grouped GEMM with TMA](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fcollective\u002Fsm100_blockscaled_mma_array_warpspecialized.hpp)\r\n  - Blackwell [collective mainloop for convolution kernels](.\u002Finclude\u002Fcutlass\u002Fconv\u002Fcollective\u002Fsm100_implicit_gemm_umma_warpspecialized.hpp) supporting non-block scaled data types for fprop, dgrad, and wgrad.\r\n  - New [GEMM](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fdispatch_policy.hpp), [convolution](.\u002Finclude\u002Fcutlass\u002Fconv\u002Fdispatch_policy.hpp), and [epilogue](.\u002Finclude\u002Fcutlass\u002Fepilogue\u002Fdispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.\r\n  - [Blackwell epilogue that supports loading accumulators from `tmem`](.\u002Finclude\u002Fcutlass\u002Fepilogue\u002Fcollective\u002Fsm100_epilogue_tma_warpspecialized.hpp) and [full set of EVT fusions]().\r\n* CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.\r\n  - Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.\r\n  - Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.\r\n  - Support for mixed input GEMM kernels on Hopper in the profiler.\r\n* New CUTLASS profiler flag `use-cuda-graphs` to reduce overheads when benchmarking launch-bound kernels.\r\n* A new 3.x version of grouped GEMM to the CUTLASS library and generates kernels for Hopper and Blackwell. Now grouped GEMM support is enabled in the CUTLASS profiler (`.\u002Fcutlass_profiler --operation=GroupedGemm --help` for details).\r\n* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architec","2025-02-21T05:32:15",{"id":256,"version":257,"summary_zh":258,"released_at":259},206367,"v3.7.0","\r\n- A new [Hopper blockwise scaling FP8 GEMM](.\u002Fexamples\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling\u002F67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) where the operands and block scaling tensor are staged via shared memory.\r\n- [Distributed GEMM](.\u002Fexamples\u002F65_distributed_gemm\u002F65_distributed_gemm.cu) is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.\r\n- Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new `make_kernel_hardware_info` API as shown in [example 48](.\u002Fexamples\u002F48_hopper_warp_specialized_gemm\u002F48_hopper_warp_specialized_gemm.cu).\r\n- Enabled high precision accumulation for Hopper FP8 Sparse GEMM.","2025-01-18T15:07:29",{"id":261,"version":262,"summary_zh":263,"released_at":264},206368,"v3.6.0","- [Hopper structured sparse GEMM](.\u002Fexamples\u002F62_hopper_sparse_gemm\u002F62_hopper_sparse_gemm.cu).\r\n  + [FP16](.\u002Ftest\u002Funit\u002Fgemm\u002Fdevice\u002Fsm90_sparse_gemm_f16_f16_f32_tensor_op_f32.cu)\r\n  + [FP8](.\u002Ftest\u002Funit\u002Fgemm\u002Fdevice\u002Fsm90_sparse_gemm_f8_f8_f32_tensor_op_f32.cu)\r\n  + [INT8](.\u002Ftest\u002Funit\u002Fgemm\u002Fdevice\u002Fsm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu)\r\n  + [TF32](.\u002Ftest\u002Funit\u002Fgemm\u002Fdevice\u002Fsm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu)\r\n- A refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` [API](.\u002Finclude\u002Fcutlass\u002Fconv\u002Fkernel\u002Fsm90_implicit_gemm_tma_warpspecialized.hpp) to bring it in line with `gemm::GemmUniversal`. Now the 3.x convolution API is no longer considered as a beta API.\r\n- [An improved mixed input GEMM](.\u002Fexamples\u002F55_hopper_mixed_dtype_gemm\u002FREADME.md) and a [lookup table implementation](.\u002Fexamples\u002F55_hopper_mixed_dtype_gemm\u002F55_hopper_int4_fp8_gemm.cu) for `INT4`x`FP8` scale-only mode.\r\n- [EVT nodes for Top-K selection and softmax](.\u002Finclude\u002Fcutlass\u002Fepilogue\u002Ffusion\u002Fsm90_visitor_topk_softmax.hpp) and [GEMM example using those](.\u002Fexamples\u002F61_hopper_gemm_with_topk_and_softmax\u002F61_hopper_gemm_with_topk_and_softmax.cu).\r\n- [Programmatic Dependent Launch](.\u002Finclude\u002Fcutlass\u002Farch\u002Fgrid_dependency_control.h) (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding [documentations](.\u002Fmedia\u002Fdocs\u002Fdependent_kernel_launch.md).\r\n- [A new debugging tool, synclog](.\u002Finclude\u002Fcutlass\u002Farch\u002Fsynclog.hpp), for dumping out all synchronization events from within a kernel to a file. Please see [synclog documentation](.\u002Fmedia\u002Fdocs\u002Futilities.md#debugging-asynchronous-kernels-with-cutlasss-built-in-synclog-tool) for details.\r\n- A new TMA-enabled [epilogue](.\u002Finclude\u002Fcutlass\u002Fepilogue\u002Fcollective\u002Fsm90_epilogue_array_tma_warpspecialized.hpp) for grouped GEMM that brings significant performance improvement, as well as its EVT support.\r\n- A SIMT-enabled pointer-array [epilogue](.\u002Finclude\u002Fcutlass\u002Fepilogue\u002Fcollective\u002Fsm70_epilogue_vectorized_array.hpp).\r\n- A new [Ping-Pong kernel schedule for Grouped GEMM](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fkernel\u002Fsm90_gemm_array_tma_warpspecialized_pingpong.hpp) and some other optimizations.\r\n- [A new instantiation strategy for CUTLASS profiler kernels](.\u002Fpython\u002Fcutlass_library\u002Fsm90_shapes.py) along with [improved documentation for instantiation level in CUTLASS profiler](.\u002Fmedia\u002Fdocs\u002Fprofiler.md#instantiating-more-kernels-with-hopper).\r\n- A new hardware support for comparisons and computations of [`cutlass::bfloat16_t`](.\u002Finclude\u002Fcutlass\u002Fbfloat16.h)\r\n- Fixed use of isnan on Windows for [`half_t`](.\u002Ftest\u002Funit\u002Fcore\u002Ffunctional.cu).","2024-12-25T22:19:24",{"id":266,"version":267,"summary_zh":268,"released_at":269},206369,"v3.5.1","- [Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code](.\u002Fexamples\u002Fcute\u002Ftutorial\u002Fwgmma_sm90.cu).\r\n- [Exposure of L2 `cache_hint`s in TMA copy atoms](.\u002Finclude\u002Fcute\u002Farch\u002Fcopy_sm90_tma.hpp#L48)\r\n- Exposure of raster order and tile swizzle extent in [CUTLASS library profiler](.\u002Fmedia\u002Fdocs\u002Fprofiler.md#GEMM), and\r\n[example 48](.\u002Fexamples\u002F48_hopper_warp_specialized_gemm\u002F48_hopper_warp_specialized_gemm.cu).\r\n- [TMA store based and EVT supported epilogues](.\u002Finclude\u002Fcutlass\u002Fepilogue\u002Fcollective\u002Fsm90_epilogue_array_tma_warpspecialized.hpp) for [Hopper pointer array batched kernels](.\u002Ftest\u002Funit\u002Fgemm\u002Fdevice\u002Fsm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array.cu).\r\n- A new [`GemmSparseUniversal` API for CUTLASS 2.x Ampere kernels](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fdevice\u002Fgemm_sparse_universal.h) to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference.\r\n- [CUDA host adapter](.\u002Finclude\u002Fcutlass\u002Fcuda_host_adapter.hpp) extensions to support TMA descriptor construction driver APIs.\r\n- Inclusion of more [Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler](.\u002Fpython\u002Fcutlass_library\u002Fgenerator.py).\r\n- Support for residual add (beta != 0) in convolution kernels.\r\n- A new convolution [epilogue](.\u002Fexamples\u002F16_ampere_tensorop_conv2dfprop\u002Fampere_tensorop_conv2dfprop.cu#L269) for CUTLASS 2.x to support non-packed NHWC output.\r\n- A refactor of [include files throughout CUTLASS core directories](.\u002Finclude\u002Fcutlass\u002Fgemm\u002Fcollective\u002Fcollective_mma_decl.hpp) to reduce circular dependencies and [tests to guard against them](.\u002Ftest\u002Fself_contained_includes\u002FCMakeLists.txt).\r\n- [A guide for setting up VSCode to work well with CUTLASS](.\u002Fmedia\u002Fdocs\u002Fide_setup.md) and [expanded code style guide](.\u002Fmedia\u002Fdocs\u002Fprogramming_guidelines.md).\r\n- Better support for MSVC as a host compiler.\r\n- Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.\r\n- Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.\r\n- NOTICE:\r\n  + Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` API to bring it in line with `gemm::GemmUniversal`. After this, the 3.x convolution API will no longer be considered as a beta API.\r\n  + Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.\r\n","2024-08-29T20:15:44"]