[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NVIDIA--cccl":3,"tool-NVIDIA--cccl":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":115,"forks":116,"last_commit_at":117,"license":118,"difficulty_score":10,"env_os":119,"env_gpu":120,"env_ram":121,"env_deps":122,"category_tags":128,"github_topics":129,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":149,"updated_at":150,"faqs":151,"releases":180},420,"NVIDIA\u002Fcccl","cccl","CUDA Core Compute Libraries","cccl（CUDA Core Compute Libraries）是 NVIDIA 推出的一个开源项目，将 Thrust、CUB 和 libcudacxx 三个核心 CUDA C++ 库整合到统一代码库中。它为开发者提供了一套高效、安全且易用的并行计算构建模块：Thrust 提供类似标准库的高级并行算法，CUB 提供面向 GPU 的底层高性能原语（如块级规约），而 libcudacxx 则实现了可在 GPU 设备代码中使用的 C++ 标准库功能及 CUDA 特有硬件特性支持。\n\ncccl 解决了以往三个库分散维护带来的开发和集成复杂性问题，通过统一管理提升兼容性与开发效率，帮助开发者更专注于业务逻辑而非底层细节。它特别适合使用 CUDA 进行高性能计算的 C++ 开发者和研究人员，尤其是需要在 GPU 上实现高效并行算法或自定义核函数的用户。其亮点在于融合高层抽象与底层控制能力，在保持性能的同时提升代码可读性和可移植性。","[![Open in GitHub Codespaces](https:\u002F\u002Fgithub.com\u002Fcodespaces\u002Fbadge.svg)](https:\u002F\u002Fcodespaces.new\u002FNVIDIA\u002Fcccl?quickstart=1&devcontainer_path=.devcontainer%2Fdevcontainer.json)\n\n|[Contributor Guide](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fblob\u002Fmain\u002FCONTRIBUTING.md)|[Dev Containers](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fblob\u002Fmain\u002F.devcontainer\u002FREADME.md)|[Discord](https:\u002F\u002Fdiscord.gg\u002Fnvidiadeveloper)|[Godbolt](https:\u002F\u002Fgodbolt.org\u002Fz\u002Fx4G73af9a)|[GitHub Project](https:\u002F\u002Fgithub.com\u002Forgs\u002FNVIDIA\u002Fprojects\u002F6)|[Documentation](https:\u002F\u002Fnvidia.github.io\u002Fcccl)|\n|-|-|-|-|-|-|\n\n# CUDA Core Compute Libraries (CCCL)\n\nWelcome to the CUDA Core Compute Libraries (CCCL) where our mission is to make CUDA more delightful.\n\nThis repository unifies three essential CUDA C++ libraries into a single, convenient repository:\n\n- [Thrust](thrust) ([former repo](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Fthrust))\n- [CUB](cub) ([former repo](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Fcub))\n- [libcudacxx](libcudacxx) ([former repo](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Flibcudacxx))\n\nThe goal of CCCL is to provide CUDA C++ developers with building blocks that make it easier to write safe and efficient code.\nBringing these libraries together streamlines your development process and broadens your ability to leverage the power of CUDA C++.\nFor more information about the decision to unify these projects, see the [announcement here](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fdiscussions\u002F520).\n\n## Overview\n\nThe concept for the CUDA Core Compute Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers.\nNaturally, there was a lot of overlap among the three projects, and it became clear the community would be better served by unifying them into a single repository.\n\n- **Thrust** is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs via configurable backends that allow using multiple parallel programming frameworks (such as CUDA, TBB, and OpenMP).\n\n- **CUB** is a lower-level, CUDA-specific library designed for speed-of-light parallel algorithms across all GPU architectures. In addition to device-wide algorithms, it provides *cooperative algorithms* like block-wide reduction and warp-wide scan, providing CUDA kernel developers with building blocks to create speed-of-light, custom kernels.\n\n- **libcudacxx** is the CUDA C++ Standard Library. It provides an implementation of the C++ Standard Library that works in both host and device code. Additionally, it provides abstractions for CUDA-specific hardware features like synchronization primitives, cache control, atomics, and more.\n\nThe main goal of CCCL is to fill a similar role that the Standard C++ Library fills for Standard C++: provide general-purpose, speed-of-light tools to CUDA C++ developers, allowing them to focus on solving the problems that matter.\nUnifying these projects is the first step towards realizing that goal.\n\n## Example\n\nThis is a simple example demonstrating the use of CCCL functionality from Thrust, CUB, and libcudacxx.\n\nIt shows how to use Thrust\u002FCUB\u002Flibcudacxx to implement a simple parallel reduction kernel.\nEach thread block computes the sum of a subset of the array using `cub::BlockReduce`.\nThe sum of each block is then reduced to a single value using an atomic add via `cuda::atomic_ref` from libcudacxx.\n\nIt then shows how the same reduction can be done using Thrust's `reduce` algorithm and compares the results.\n\n[Try it live on Godbolt!](https:\u002F\u002Fgodbolt.org\u002Fz\u002F3KaWz3Msf)\n\n```cpp\n#include \u003Cthrust\u002Fexecution_policy.h>\n#include \u003Cthrust\u002Fdevice_vector.h>\n#include \u003Ccub\u002Fblock\u002Fblock_reduce.cuh>\n#include \u003Ccuda\u002Fatomic>\n#include \u003Ccuda\u002Fcmath>\n#include \u003Ccuda\u002Fstd\u002Fspan>\n#include \u003Ccstdio>\n\ntemplate \u003Cint block_size>\n__global__ void reduce(cuda::std::span\u003Cint const> data, cuda::std::span\u003Cint> result) {\n  using BlockReduce = cub::BlockReduce\u003Cint, block_size>;\n  __shared__ typename BlockReduce::TempStorage temp_storage;\n\n  int const index = threadIdx.x + blockIdx.x * blockDim.x;\n  int sum = 0;\n  if (index \u003C data.size()) {\n    sum += data[index];\n  }\n  sum = BlockReduce(temp_storage).Sum(sum);\n\n  if (threadIdx.x == 0) {\n    cuda::atomic_ref\u003Cint, cuda::thread_scope_device> atomic_result(result.front());\n    atomic_result.fetch_add(sum, cuda::memory_order_relaxed);\n  }\n}\n\nint main() {\n\n  \u002F\u002F Allocate and initialize input data\n  int const N = 1000;\n  thrust::device_vector\u003Cint> data(N);\n  thrust::fill(data.begin(), data.end(), 1);\n\n  \u002F\u002F Allocate output data\n  thrust::device_vector\u003Cint> kernel_result(1);\n\n  \u002F\u002F Compute the sum reduction of `data` using a custom kernel\n  constexpr int block_size = 256;\n  int const num_blocks = cuda::ceil_div(N, block_size);\n  reduce\u003Cblock_size>\u003C\u003C\u003Cnum_blocks, block_size>>>(cuda::std::span\u003Cint const>(thrust::raw_pointer_cast(data.data()), data.size()),\n                                                 cuda::std::span\u003Cint>(thrust::raw_pointer_cast(kernel_result.data()), 1));\n\n  auto const err = cudaDeviceSynchronize();\n  if (err != cudaSuccess) {\n    std::cout \u003C\u003C \"Error: \" \u003C\u003C cudaGetErrorString(err) \u003C\u003C std::endl;\n    return -1;\n  }\n\n  int const custom_result = kernel_result[0];\n\n  \u002F\u002F Compute the same sum reduction using Thrust\n  int const thrust_result = thrust::reduce(thrust::device, data.begin(), data.end(), 0);\n\n  \u002F\u002F Ensure the two solutions are identical\n  std::printf(\"Custom kernel sum: %d\\n\", custom_result);\n  std::printf(\"Thrust reduce sum: %d\\n\", thrust_result);\n  assert(kernel_result[0] == thrust_result);\n  return 0;\n}\n```\n\n## Getting Started\n\n### Users\n\nEverything in CCCL is header-only.\nTherefore, users need only concern themselves with how they get the header files and how they incorporate them into their build system.\n\n#### CUDA Toolkit\nThe easiest way to get started using CCCL is via the [CUDA Toolkit](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit) which includes the CCCL headers.\nWhen you compile with `nvcc`, it automatically adds CCCL headers to your include path so you can simply `#include` any CCCL header in your code with no additional configuration required.\n\nIf compiling with another compiler, you will need to update your build system's include search path to point to the CCCL headers in your CTK install (e.g., `\u002Fusr\u002Flocal\u002Fcuda\u002Finclude`).\n\n```cpp\n#include \u003Cthrust\u002Fdevice_vector.h>\n#include \u003Ccub\u002Fcub.cuh>\n#include \u003Ccuda\u002Fstd\u002Fatomic>\n```\n\n#### GitHub\n\nUsers who want to stay on the cutting edge of CCCL development are encouraged to use CCCL from GitHub.\nUsing a newer version of CCCL with an older version of the CUDA Toolkit is supported, but not the other way around.\nFor complete information on compatibility between CCCL and the CUDA Toolkit, see [our platform support](#platform-support).\n\nEverything in CCCL is header-only, so cloning and including it in a simple project is as easy as the following:\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl.git\nnvcc -Icccl\u002Fthrust -Icccl\u002Flibcudacxx\u002Finclude -Icccl\u002Fcub main.cu -o main\n```\n> **Note**\n> Use `-I` and not `-isystem` to avoid collisions with the CCCL headers implicitly included by `nvcc` from the CUDA Toolkit. All CCCL headers use `#pragma system_header` to ensure warnings will still be silenced as if using `-isystem`, see https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fissues\u002F527 for more information.\n\n##### Installation\n\nThe default CMake options generate only installation rules, so the familiar\n`cmake . && make install` workflow just works:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl.git\ncd cccl\ncmake . -DCMAKE_INSTALL_PREFIX=\u002Fusr\u002Flocal\nmake install\n```\n\nA convenience script is also provided:\n\n```bash\nci\u002Finstall_cccl.sh \u002Fusr\u002Flocal\n```\n\n###### Advanced installation using presets\n\nCMake presets are also available with options for including experimental\nlibraries:\n\n```bash\ncmake --preset install -DCMAKE_INSTALL_PREFIX=\u002Fusr\u002Flocal\ncmake --build --preset install --target install\n```\n\nUse the `install-unstable` preset to include experimental libraries, or\n`install-unstable-only` to install only experimental libraries.\n\n#### Conda\n\nCCCL also provides conda packages of each release via the `conda-forge` channel:\n\n```bash\nconda config --add channels conda-forge\nconda install cccl\n```\n\nThis will install the latest CCCL to the conda environment's `$CONDA_PREFIX\u002Finclude\u002F` and `$CONDA_PREFIX\u002Flib\u002Fcmake\u002F` directories.\nIt is discoverable by CMake via `find_package(CCCL)` and can be used by any compilers in the conda environment.\nFor more information, see [this introduction to conda-forge](https:\u002F\u002Fconda-forge.org\u002Fdocs\u002Fuser\u002Fintroduction\u002F).\n\nIf you want to use the same CCCL version that shipped with a particular CUDA Toolkit, e.g. CUDA 12.4, you can install CCCL with:\n\n```bash\nconda config --add channels conda-forge\nconda install cuda-cccl cuda-version=12.4\n```\n\nThe `cuda-cccl` metapackage installs the `cccl` version that shipped with the CUDA Toolkit corresponding to `cuda-version`.\nIf you wish to update to the latest `cccl` after installing `cuda-cccl`, uninstall `cuda-cccl` before updating `cccl`:\n\n```bash\nconda uninstall cuda-cccl\nconda install -c conda-forge cccl\n```\n\n> **Note**\n> There are also conda packages with names like `cuda-cccl_linux-64`.\n> Those packages contain the CCCL versions shipped as part of the CUDA Toolkit, but are designed for internal use by the CUDA Toolkit.\n> Install `cccl` or `cuda-cccl` instead, for compatibility with conda compilers.\n> For more information, see the [cccl conda-forge recipe](https:\u002F\u002Fgithub.com\u002Fconda-forge\u002Fcccl-feedstock\u002Fblob\u002Fmain\u002Frecipe\u002Fmeta.yaml).\n\n##### CMake Integration\n\nCCCL uses [CMake](https:\u002F\u002Fcmake.org\u002F) for all build and installation infrastructure, including tests as well as targets to link against in other CMake projects.\nTherefore, CMake is the recommended way to integrate CCCL into another project.\n\nFor a complete example of how to do this using CMake Package Manager see [our basic example project](examples\u002Fbasic).\n\nOther build systems should work, but only CMake is tested.\nContributions to simplify integrating CCCL into other build systems are welcome.\n\n### Contributors\n\nInterested in contributing to making CCCL better? Check out our [Contributing Guide](CONTRIBUTING.md) for a comprehensive overview of everything you need to know to set up your development environment, make changes, run tests, and submit a PR.\n\n## Platform Support\n\n**Objective:** This section describes where users can expect CCCL to compile and run successfully.\n\nIn general, CCCL should work everywhere the CUDA Toolkit is supported, however, the devil is in the details.\nThe sections below describe the details of support and testing for different versions of the CUDA Toolkit, host compilers, and C++ dialects.\n\n### CUDA Toolkit (CTK) Compatibility\n\n**Summary:**\n- The latest version of CCCL is backward compatible with the current and preceding CTK major version series\n- CCCL is never forward compatible with any version of the CTK. Always use the same or newer than what is included with your CTK.\n- Minor version CCCL upgrades won't break existing code, but new features may not support all CTK versions\n\nCCCL users are encouraged to capitalize on the latest enhancements and [\"live at head\"](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=tISy7EJQPzI) by always using the newest version of CCCL.\nFor a seamless experience, you can upgrade CCCL independently of the entire CUDA Toolkit.\nThis is possible because CCCL maintains backward compatibility with the latest patch release of every minor CTK release from both the current and previous major version series.\nIn some exceptional cases, the minimum supported minor version of the CUDA Toolkit release may need to be newer than the oldest release within its major version series.\n\nWhen a new major CTK is released, we drop support for the oldest supported major version.\n\n| CCCL Version | Supports CUDA Toolkit Version                  |\n|--------------|------------------------------------------------|\n| 2.x          | 11.1 - 11.8, 12.x (only latest patch releases) |\n| 3.x          | 12.x, 13.x  (only latest patch releases)       |\n\n[Well-behaved code](#compatibility-guidelines) using the latest CCCL should compile and run successfully with any supported CTK version.\nExceptions may occur for new features that depend on new CTK features, so those features would not work on older versions of the CTK.\n\nUsers can integrate a newer version of CCCL into an older CTK, but not the other way around.\nThis means an older version of CCCL is not compatible with a newer CTK.\nIn other words, **CCCL is never forward compatible with the CUDA Toolkit.**\n\nThe table below summarizes compatibility of the CTK and CCCL:\n\n| CTK Version | Included CCCL Version |    Desired CCCL     | Supported? |                           Notes                           |\n|:-----------:|:---------------------:|:--------------------:|:----------:|:--------------------------------------------------------:|\n|  CTK `X.Y`  |  CCCL `MAJOR.MINOR`   | CCCL `MAJOR.MINOR+n` |    ✅     |            Some new features might not work              |\n|  CTK `X.Y`  |  CCCL `MAJOR.MINOR`   | CCCL `MAJOR+1.MINOR` |    ✅     | Possible breaks; some new features might not be available|\n|  CTK `X.Y`  |  CCCL `MAJOR.MINOR`   | CCCL `MAJOR+2.MINOR` |    ❌     |    CCCL supports only two CTK major versions             |\n|  CTK `X.Y`  |  CCCL `MAJOR.MINOR`   | CCCL `MAJOR.MINOR-n` |    ❌     |          CCCL isn't forward compatible                   |\n|  CTK `X.Y`  |  CCCL `MAJOR.MINOR`   | CCCL `MAJOR-n.MINOR` |    ❌     |          CCCL isn't forward compatible                   |\n\nFor more information on CCCL versioning, API\u002FABI compatibility, and breaking changes see the [Versioning](#versioning) section below.\n\n### Operating Systems\n\nUnless otherwise specified, CCCL supports all the same operating systems as the CUDA Toolkit, which are documented here:\n - [Linux](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-installation-guide-linux\u002Findex.html#system-requirements)\n - [Windows](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-installation-guide-microsoft-windows\u002Findex.html#system-requirements)\n\n### Host Compilers\n\nUnless otherwise specified, CCCL supports the same host compilers as the latest CUDA Toolkit, which are documented here:\n- [Linux](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-installation-guide-linux\u002Findex.html#host-compiler-support-policy)\n- [Windows](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-installation-guide-microsoft-windows\u002Findex.html#system-requirements)\n\nFor GCC on Linux, at least 7.x is required.\n\nWhen using older CUDA Toolkits, we also only support the host compilers of the latest CUDA Toolkit,\nbut at least the most recent host compiler of any supported older CUDA Toolkit.\n\nWe may retain support of additional compilers and will accept corresponding patches from the community with reasonable fixes.\nBut we will not invest significant time in triaging or fixing issues for older compilers.\n\nIn the spirit of \"You only support what you test\", see our [CI Overview](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fblob\u002Fmain\u002Fci-overview.md) for more information on exactly what we test.\n\n### GPU Architectures\n\nCCCL supports all GPU architectures that are [supported by the *current* major CUDA Toolkit (CTK)](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus).\n\nTo be clear, while CCCL can be compiled with both the current and previous CTK major versions, we do not test or validate architectures that were only supported in the older CTK.\n\nThose architectures may still work — we do not intentionally break them — but they are outside our regular CI coverage. Furthermore, new features are not guaranteed to work with these architectures either.\n\nWe welcome community contributions for reasonable fixes that unblock users on these older architectures.\n\nFor example, CCCL 3.0 supports compiling with CTK 12.x and 13.x where\n- CUDA Toolkit 13.x supports `>=sm_75`\n- CUDA Toolkit 12.x supports `>=sm_50`\n\nIn this scenario, compiling CCCL 3.0 with CTK 12.x targeting architectures below `sm_75` may work, but those configurations are not part of our regular testing.\n\n### C++ Dialects\n- C++17\n- C++20\n\n### Testing Strategy\n\nCCCL's testing strategy strikes a balance between testing as many configurations as possible and maintaining reasonable CI times.\n\nFor CUDA Toolkit versions, testing is done against both the oldest and the newest supported versions.\nFor instance, if the latest version of the CUDA Toolkit is 12.6, tests are conducted against 11.1 and 12.6.\nFor each CUDA version, builds are completed against all supported host compilers with all supported C++ dialects.\n\nThe testing strategy and matrix are constantly evolving.\nThe matrix defined in the [`ci\u002Fmatrix.yaml`](ci\u002Fmatrix.yaml) file is the definitive source of truth.\nFor more information about our CI pipeline, see [here](ci-overview.md).\n\n## Versioning\n\n**Objective:** This section describes how CCCL is versioned, API\u002FABI stability guarantees, and compatibility guidelines to minimize upgrade headaches.\n\n**Summary**\n- The entirety of CCCL's API shares a common semantic version across all components\n- Only the most recently released version is supported and fixes are not backported to prior releases\n- API breaking changes and incrementing CCCL's major version will only coincide with a new major version release of the CUDA Toolkit\n- Not all source breaking changes are considered breaking changes of the public API that warrant bumping the major version number\n- Do not rely on ABI stability of entities in the `cub::` or `thrust::` namespaces\n- ABI breaking changes for symbols in the `cuda::` namespace may happen at any time, but will be reflected by incrementing the ABI version which is embedded in an inline namespace for all `cuda::` symbols. Multiple ABI versions may be supported concurrently.\n\n**Note:** Prior to merging Thrust, CUB, and libcudacxx into this repository, each library was independently versioned according to semantic versioning.\nStarting with the 2.1 release, all three libraries synchronized their release versions in their separate repositories.\nMoving forward, CCCL will continue to be released under a single [semantic version](https:\u002F\u002Fsemver.org\u002F), with 2.2.0 being the first release from the [nvidia\u002Fcccl](www.github.com\u002Fnvidia\u002Fcccl) repository.\n\n### Breaking Change\n\nA Breaking Change is a change to **explicitly supported** functionality between released versions that would require a user to do work in order to upgrade to the newer version.\n\nIn the limit, [_any_ change](https:\u002F\u002Fwww.hyrumslaw.com\u002F) has the potential to break someone somewhere.\nAs a result, not all possible source breaking changes are considered Breaking Changes to the public API that warrant bumping the major semantic version.\n\nThe sections below describe the details of breaking changes to CCCL's API and ABI.\n\n### Application Programming Interface (API)\n\nCCCL's public API is the entirety of the functionality _intentionally_ exposed to provide the utility of the library.\n\nIn other words, CCCL's public API goes beyond just function signatures and includes (but is not limited to):\n- The location and names of headers intended for direct inclusion in user code\n- The namespaces intended for direct use in user code\n- The declarations and\u002For definitions of functions, classes, and variables located in headers and intended for direct use in user code\n- The semantics of functions, classes, and variables intended for direct use in user code\n\nMoreover, CCCL's public API does **not** include any of the following:\n- Any symbol prefixed with `_` or `__`\n- Any symbol whose name contains `detail` including the `detail::` namespace or a macro\n- Any header file contained in a `detail\u002F` directory or sub-directory thereof\n- The header files implicitly included by any header part of the public API\n\nIn general, the goal is to avoid breaking anything in the public API.\nSuch changes are made only if they offer users better performance, easier-to-understand APIs, and\u002For more consistent APIs.\n\nAny breaking change to the public API will require bumping CCCL's major version number.\nIn keeping with [CUDA Minor Version Compatibility](https:\u002F\u002Fdocs.nvidia.com\u002Fdeploy\u002Fcuda-compatibility\u002F#minor-version-compatibility),\nAPI breaking changes and CCCL major version bumps will only occur coinciding with a new major version release of the CUDA Toolkit.\n\nAnything not part of the public API may change at any time without warning.\n\n#### API Versioning\n\nThe public API of all CCCL's components share a unified semantic version of `MAJOR.MINOR.PATCH`.\n\nOnly the most recently released version is supported.\nAs a rule, features and bug fixes are not backported to previously released version or branches.\n\nThe preferred method for querying the version is to use `CCCL_[MAJOR\u002FMINOR\u002FPATCH_]VERSION` as described below.\nFor backwards compatibility, the Thrust\u002FCUB\u002Flibcudacxxx version definitions are available and will always be consistent with `CCCL_VERSION`.\nNote that Thrust\u002FCUB use a `MMMmmmpp` scheme whereas the CCCL and libcudacxx use `MMMmmmppp`.\n\n|                        | CCCL                                   | libcudacxx                                | Thrust                       | CUB                       |\n|------------------------|----------------------------------------|-------------------------------------------|------------------------------|---------------------------|\n| Header                 | `\u003Ccuda\u002Fversion>`                       | `\u003Ccuda\u002Fstd\u002Fversion>`                      | `\u003Cthrust\u002Fversion.h>`         | `\u003Ccub\u002Fversion.h>`         |\n| Major Version          | `CCCL_MAJOR_VERSION`                   | `_LIBCUDACXX_CUDA_API_VERSION_MAJOR`      | `THRUST_MAJOR_VERSION`       | `CUB_MAJOR_VERSION`       |\n| Minor Version          | `CCCL_MINOR_VERSION`                   | `_LIBCUDACXX_CUDA_API_VERSION_MINOR`      | `THRUST_MINOR_VERSION`       | `CUB_MINOR_VERSION`       |\n| Patch\u002FSubminor Version | `CCCL_PATCH_VERSION`                   | `_LIBCUDACXX_CUDA_API_VERSION_PATCH`      | `THRUST_SUBMINOR_VERSION`    | `CUB_SUBMINOR_VERSION`    |\n| Concatenated Version   | `CCCL_VERSION (MMMmmmppp)`             | `_LIBCUDACXX_CUDA_API_VERSION (MMMmmmppp)`| `THRUST_VERSION (MMMmmmpp)`  | `CUB_VERSION (MMMmmmpp)`  |\n\n### Application Binary Interface (ABI)\n\nThe Application Binary Interface (ABI) is a set of rules for:\n- How a library's components are represented in machine code\n- How those components interact across different translation units\n\nA library's ABI includes, but is not limited to:\n- The mangled names of functions and types\n- The size and alignment of objects and types\n- The semantics of the bytes in the binary representation of an object\n\nAn **ABI Breaking Change** is any change that results in a change to the ABI of a function or type in the public API.\nFor example, adding a new data member to a struct is an ABI Breaking Change as it changes the size of the type.\n\nIn CCCL, the guarantees about ABI are as follows:\n\n- Symbols in the `thrust::` and `cub::` namespaces may break ABI at any time without warning.\n- The ABI of `thrust::` and `cub::` [symbols includes the CUDA architectures used for compilation](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fcub\u002Fdeveloper_overview.html#symbols-visibility). Therefore, a `thrust::` or `cub::` symbol may have a different ABI if:\n    - compiled with different architectures\n    - compiled as a CUDA source file (`-x cu`) vs C++ source (`-x cpp`)\n- Symbols in the `cuda::` namespace may also break ABI at any time. However, `cuda::` symbols embed an ABI version number that is incremented whenever an ABI break occurs. Multiple ABI versions may be supported concurrently, and therefore users have the option to revert to a prior ABI version. For more information, see [here](libcudacxx\u002Fdocs\u002Freleases\u002Fversioning.md).\n\n**Who should care about ABI?**\n\nIn general, CCCL users only need to worry about ABI issues when building or using a binary artifact (like a shared library) whose API directly or indirectly includes types provided by CCCL.\n\nFor example, consider if `libA.so` was built using CCCL version `X` and its public API includes a function like:\n```c++\nvoid foo(cuda::std::optional\u003Cint>);\n```\n\nIf another library, `libB.so`, is compiled using CCCL version `Y` and uses `foo` from `libA.so`, then this can fail if there was an ABI break between version `X` and `Y`.\nUnlike with API breaking changes, ABI breaks usually do not require code changes and only require recompiling everything to use the same ABI version.\n\nTo learn more about ABI and why it is important, see [What is ABI, and What Should C++ Do About It?](https:\u002F\u002Fwg21.link\u002FP2028R0).\n\n### Compatibility Guidelines\n\nAs mentioned above, not all possible source breaking changes constitute a Breaking Change that would require incrementing CCCL's API major version number.\n\nUsers are encouraged to adhere to the following guidelines in order to minimize the risk of disruptions from accidentally depending on parts of CCCL that are not part of the public API:\n\n- Do not add any declarations to, or specialize any template from, the `thrust::`, `cub::`, `nv::`, or `cuda::` namespaces unless an exception is noted for a specific symbol, e.g., specializing `cuda::std::iterator_traits`\n    - **Rationale**: This would cause conflicts if a symbol or specialization is added with the same name.\n- Do not take the address of any API in the `thrust::`, `cub::`, `cuda::`, or `nv::` namespaces.\n    - **Rationale**: This would prevent adding overloads of these APIs.\n- Do not forward declare any API in the `thrust::`, `cub::`, `cuda::`, or `nv::` namespaces.\n    - **Rationale**: This would prevent adding overloads of these APIs.\n- Do not directly reference any symbol prefixed with `_`, `__`, or with `detail` anywhere in its name including a `detail::` namespace or macro\n     - **Rationale**: These symbols are for internal use only and may change at any time without warning.\n- Include what you use. For every CCCL symbol that you use, directly `#include` the header file that declares that symbol. In other words, do not rely on headers implicitly included by other headers.\n     - **Rationale**: Internal includes may change at any time.\n\nPortions of this section were inspired by [Abseil's Compatibility Guidelines](https:\u002F\u002Fabseil.io\u002Fabout\u002Fcompatibility).\n\n## Deprecation Policy\n\nWe will do our best to notify users prior to making any breaking changes to the public API, ABI, or modifying the supported platforms and compilers.\n\nAs appropriate, deprecations will come in the form of programmatic warnings which can be disabled.\n\nThe deprecation period will depend on the impact of the change, but will usually last at least 2 minor version releases.\n\n\n## Mapping to CTK Versions\n\n| CCCL version | CTK version |\n|--------------|-------------|\n| 3.2          | 13.2        |\n| 3.1          | 13.1        |\n| 3.0          | 13.0        |\n| 2.8          | 12.9        |\n| 2.7          | 12.8        |\n| 2.5          | 12.6        |\n| 2.4          | 12.5        |\n| 2.3          | 12.4        |\n\nTest yourself: https:\u002F\u002Fcuda.godbolt.org\u002Fz\u002FK818M4Y9f\n\nCTKs before 12.4 shipped Thrust, CUB and libcudacxx as individual libraries.\n\n| Thrust\u002FCUB\u002Flibcudacxx version | CTK version |\n|-------------------------------|-------------|\n| 2.2                           | 12.3        |\n| 2.1                           | 12.2        |\n| 2.0\u002F2.0\u002F1.9                   | 12.1        |\n| 2.0\u002F2.0\u002F1.9                   | 12.0        |\n\n\n## CI Pipeline Overview\n\nFor a detailed overview of the CI pipeline, see [ci-overview.md](ci-overview.md).\n\n## Related Projects\n\nProjects that are related to CCCL's mission to make CUDA more delightful:\n- [cuCollections](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FcuCollections) - GPU accelerated data structures like hash tables\n- [NVBench](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnvbench) - Benchmarking library tailored for CUDA applications\n- [stdexec](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Fstdexec) - Reference implementation for Senders asynchronous programming model\n\n## Projects Using CCCL\n\nDoes your project use CCCL? [Open a PR to add your project to this list!](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fedit\u002Fmain\u002FREADME.md)\n\n- [AmgX](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FAMGX) - Multi-grid linear solver library\n- [ColossalAI](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI) - Tools for writing distributed deep learning models\n- [cuDF](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fcudf) - Algorithms and file readers for ETL data analytics\n- [cuGraph](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fcugraph) - Algorithms for graph analytics\n- [cuML](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fcuml) - Machine learning algorithms and primitives\n- [cuOpt](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcuopt) - Accelerated decision optimization\n- [CuPy](https:\u002F\u002Fcupy.dev) - NumPy & SciPy for GPU\n- [cuSOLVER](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcusolver) - Dense and sparse linear solvers\n- [CUSP](https:\u002F\u002Fgithub.com\u002Fcusplibrary\u002Fcusplibrary) - Sparse matrix operations, iterative methods, and algebraic multigrid\n- [cuVS](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fcuvs) - Approximate clustering and vector search\n- [GooFit](https:\u002F\u002Fgithub.com\u002FGooFit\u002FGooFit) - Library for maximum-likelihood fits\n- [HeavyDB](https:\u002F\u002Fgithub.com\u002Fheavyai\u002Fheavydb) - SQL database engine\n- [HOOMD](https:\u002F\u002Fgithub.com\u002Fglotzerlab\u002Fhoomd-blue) - Monte Carlo and molecular dynamics simulations\n- [HugeCTR](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR) - GPU-accelerated recommender framework\n- [Hydra](https:\u002F\u002Fgithub.com\u002FMultithreadCorner\u002FHydra) - High-energy Physics Data Analysis\n- [Hypre](https:\u002F\u002Fgithub.com\u002Fhypre-space\u002Fhypre) - Multigrid linear solvers\n- [LightSeq](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Flightseq) - Training and inference for sequence processing and generation\n- [MatX](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fmatx) - Numerical computing library using expression templates to provide efficient, Python-like syntax\n- [PyTorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch) - Tensor and neural network computations\n- [Qiskit](https:\u002F\u002Fgithub.com\u002FQiskit\u002Fqiskit-aer) - High performance simulator for quantum circuits\n- [QUDA](https:\u002F\u002Fgithub.com\u002Flattice\u002Fquda) - Lattice quantum chromodynamics (QCD) computations\n- [RAFT](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fraft) - Algorithms and primitives for machine learning\n- [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) - LLM serving framework\n- [TensorFlow](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensorflow) - End-to-end platform for machine learning\n- [TensorRT](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT) - Deep learning inference\n- [TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) - Optimized LLM inference\n- [tsne-cuda](https:\u002F\u002Fgithub.com\u002FCannyLab\u002Ftsne-cuda) - Stochastic Neighborhood Embedding library\n- [Visualization Toolkit (VTK)](https:\u002F\u002Fgitlab.kitware.com\u002Fvtk\u002Fvtk) - Rendering and visualization library\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) - LLM inference and serving\n- [XGBoost](https:\u002F\u002Fgithub.com\u002Fdmlc\u002Fxgboost) - Gradient boosting machine learning algorithms\n","[![Open in GitHub Codespaces](https:\u002F\u002Fgithub.com\u002Fcodespaces\u002Fbadge.svg)](https:\u002F\u002Fcodespaces.new\u002FNVIDIA\u002Fcccl?quickstart=1&devcontainer_path=.devcontainer%2Fdevcontainer.json)\n\n|[贡献者指南 (Contributor Guide)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fblob\u002Fmain\u002FCONTRIBUTING.md)|[开发容器 (Dev Containers)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fblob\u002Fmain\u002F.devcontainer\u002FREADME.md)|[Discord](https:\u002F\u002Fdiscord.gg\u002Fnvidiadeveloper)|[Godbolt](https:\u002F\u002Fgodbolt.org\u002Fz\u002Fx4G73af9a)|[GitHub 项目 (GitHub Project)](https:\u002F\u002Fgithub.com\u002Forgs\u002FNVIDIA\u002Fprojects\u002F6)|[文档 (Documentation)](https:\u002F\u002Fnvidia.github.io\u002Fcccl)|\n|-|-|-|-|-|-|\n\n# CUDA 核心计算库（CUDA Core Compute Libraries, CCCL）\n\n欢迎来到 CUDA 核心计算库（CCCL），我们的使命是让 CUDA 开发更加愉悦。\n\n本仓库将三个关键的 CUDA C++ 库统一到一个便捷的代码仓库中：\n\n- [Thrust](thrust)（[原仓库](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Fthrust)）\n- [CUB](cub)（[原仓库](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Fcub)）\n- [libcudacxx](libcudacxx)（[原仓库](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Flibcudacxx)）\n\nCCCL 的目标是为 CUDA C++ 开发者提供构建模块，使其更容易编写安全且高效的代码。  \n将这些库整合在一起可简化您的开发流程，并增强您利用 CUDA C++ 强大能力的灵活性。  \n有关统一这些项目的决策详情，请参阅[此处公告](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fdiscussions\u002F520)。\n\n## 概述\n\nCUDA 核心计算库（CCCL）的概念源于多年来独立发展的 Thrust、CUB 和 libcudacxx 项目，它们有着相似的目标：为 CUDA 开发者提供高质量、高性能且易于使用的 C++ 抽象。  \n自然地，这三个项目之间存在大量重叠，因此很明显，将它们统一到一个仓库中将更好地服务社区。\n\n- **Thrust** 是一个 C++ 并行算法库，它启发了 C++ 标准库引入并行算法。Thrust 的高层接口极大地提升了程序员的生产力，同时通过可配置的后端（支持多种并行编程框架，如 CUDA、TBB 和 OpenMP）实现了在 GPU 与多核 CPU 之间的性能可移植性。\n\n- **CUB** 是一个底层的、专为 CUDA 设计的库，旨在为所有 GPU 架构提供极致性能的并行算法。除了设备级（device-wide）算法外，它还提供了*协作式算法*（cooperative algorithms），例如块级（block-wide）归约和线程束级（warp-wide）扫描，为 CUDA 内核开发者提供构建极致性能自定义内核所需的构建模块。\n\n- **libcudacxx** 是 CUDA C++ 标准库。它提供了可在主机（host）和设备（device）代码中使用的 C++ 标准库实现。此外，它还提供了针对 CUDA 特定硬件功能的抽象，例如同步原语（synchronization primitives）、缓存控制（cache control）、原子操作（atomics）等。\n\nCCCL 的主要目标是扮演类似于标准 C++ 库在标准 C++ 中的角色：为 CUDA C++ 开发者提供通用、极致性能的工具，使其能够专注于解决真正重要的问题。  \n统一这些项目是实现这一目标的第一步。\n\n## 示例\n\n以下是一个简单示例，演示了如何使用来自 Thrust、CUB 和 libcudacxx 的 CCCL 功能。\n\n该示例展示了如何使用 Thrust\u002FCUB\u002Flibcudacxx 实现一个简单的并行归约内核。  \n每个线程块使用 `cub::BlockReduce` 计算数组子集的和。  \n然后，每个块的和通过 libcudacxx 中的 `cuda::atomic_ref` 使用原子加法归约为单个值。\n\n接着，示例展示了如何使用 Thrust 的 `reduce` 算法完成相同的归约操作，并比较结果。\n\n[立即在 Godbolt 上尝试！](https:\u002F\u002Fgodbolt.org\u002Fz\u002F3KaWz3Msf)\n\n```cpp\n#include \u003Cthrust\u002Fexecution_policy.h>\n#include \u003Cthrust\u002Fdevice_vector.h>\n#include \u003Ccub\u002Fblock\u002Fblock_reduce.cuh>\n#include \u003Ccuda\u002Fatomic>\n#include \u003Ccuda\u002Fcmath>\n#include \u003Ccuda\u002Fstd\u002Fspan>\n#include \u003Ccstdio>\n\ntemplate \u003Cint block_size>\n__global__ void reduce(cuda::std::span\u003Cint const> data, cuda::std::span\u003Cint> result) {\n  using BlockReduce = cub::BlockReduce\u003Cint, block_size>;\n  __shared__ typename BlockReduce::TempStorage temp_storage;\n\n  int const index = threadIdx.x + blockIdx.x * blockDim.x;\n  int sum = 0;\n  if (index \u003C data.size()) {\n    sum += data[index];\n  }\n  sum = BlockReduce(temp_storage).Sum(sum);\n\n  if (threadIdx.x == 0) {\n    cuda::atomic_ref\u003Cint, cuda::thread_scope_device> atomic_result(result.front());\n    atomic_result.fetch_add(sum, cuda::memory_order_relaxed);\n  }\n}\n\nint main() {\n\n  \u002F\u002F 分配并初始化输入数据\n  int const N = 1000;\n  thrust::device_vector\u003Cint> data(N);\n  thrust::fill(data.begin(), data.end(), 1);\n\n  \u002F\u002F 分配输出数据\n  thrust::device_vector\u003Cint> kernel_result(1);\n\n  \u002F\u002F 使用自定义内核计算 `data` 的归约和\n  constexpr int block_size = 256;\n  int const num_blocks = cuda::ceil_div(N, block_size);\n  reduce\u003Cblock_size>\u003C\u003C\u003Cnum_blocks, block_size>>>(cuda::std::span\u003Cint const>(thrust::raw_pointer_cast(data.data()), data.size()),\n                                                 cuda::std::span\u003Cint>(thrust::raw_pointer_cast(kernel_result.data()), 1));\n\n  auto const err = cudaDeviceSynchronize();\n  if (err != cudaSuccess) {\n    std::cout \u003C\u003C \"Error: \" \u003C\u003C cudaGetErrorString(err) \u003C\u003C std::endl;\n    return -1;\n  }\n\n  int const custom_result = kernel_result[0];\n\n  \u002F\u002F 使用 Thrust 计算相同的归约和\n  int const thrust_result = thrust::reduce(thrust::device, data.begin(), data.end(), 0);\n\n  \u002F\u002F 确保两种方法的结果一致\n  std::printf(\"Custom kernel sum: %d\\n\", custom_result);\n  std::printf(\"Thrust reduce sum: %d\\n\", thrust_result);\n  assert(kernel_result[0] == thrust_result);\n  return 0;\n}\n```\n\n## 快速开始\n\n### 用户\n\nCCCL 中的所有内容均为头文件（header-only）形式。  \n因此，用户只需关注如何获取头文件以及如何将其集成到自己的构建系统中即可。\n\n#### CUDA Toolkit\n使用 CCCL 最简单的方式是通过 [CUDA Toolkit](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit)，其中已包含 CCCL 的头文件。  \n当你使用 `nvcc` 编译时，它会自动将 CCCL 头文件路径添加到包含路径中，因此你可以直接在代码中 `#include` 任意 CCCL 头文件，无需额外配置。\n\n如果使用其他编译器进行编译，则需要更新构建系统的包含路径（include search path），使其指向 CUDA Toolkit 安装目录中的 CCCL 头文件（例如 `\u002Fusr\u002Flocal\u002Fcuda\u002Finclude`）。\n\n```cpp\n#include \u003Cthrust\u002Fdevice_vector.h>\n#include \u003Ccub\u002Fcub.cuh>\n#include \u003Ccuda\u002Fstd\u002Fatomic>\n```\n\n#### GitHub\n\n希望使用最新版 CCCL 的用户建议直接从 GitHub 获取。  \n**注意**：使用较新版本的 CCCL 配合较旧版本的 CUDA Toolkit 是支持的，但反过来则不支持。  \n有关 CCCL 与 CUDA Toolkit 兼容性的完整信息，请参阅 [平台支持](#平台支持)。\n\n由于 CCCL 全部为头文件形式，因此克隆并将其包含到一个简单项目中非常容易：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl.git\nnvcc -Icccl\u002Fthrust -Icccl\u002Flibcudacxx\u002Finclude -Icccl\u002Fcub main.cu -o main\n```\n\n> **注意**  \n> 请使用 `-I` 而非 `-isystem`，以避免与 `nvcc` 从 CUDA Toolkit 中隐式包含的 CCCL 头文件发生冲突。所有 CCCL 头文件均使用 `#pragma system_header`，确保即使使用 `-I` 也能像使用 `-isystem` 一样屏蔽警告。更多信息请参见 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fissues\u002F527。\n\n##### 安装\n\n默认的 CMake 选项仅生成安装规则，因此熟悉的 `cmake . && make install` 工作流可直接使用：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl.git\ncd cccl\ncmake . -DCMAKE_INSTALL_PREFIX=\u002Fusr\u002Flocal\nmake install\n```\n\n我们还提供了一个便捷脚本：\n\n```bash\nci\u002Finstall_cccl.sh \u002Fusr\u002Flocal\n```\n\n###### 使用预设（presets）进行高级安装\n\nCMake 预设（presets）也已提供，并支持包含实验性库的选项：\n\n```bash\ncmake --preset install -DCMAKE_INSTALL_PREFIX=\u002Fusr\u002Flocal\ncmake --build --preset install --target install\n```\n\n使用 `install-unstable` 预设可包含实验性库，或使用 `install-unstable-only` 预设仅安装实验性库。\n\n#### Conda\n\nCCCL 还通过 `conda-forge` 频道为每个版本提供了 conda 包：\n\n```bash\nconda config --add channels conda-forge\nconda install cccl\n```\n\n这会将最新版 CCCL 安装到 conda 环境的 `$CONDA_PREFIX\u002Finclude\u002F` 和 `$CONDA_PREFIX\u002Flib\u002Fcmake\u002F` 目录中。  \n该安装可通过 CMake 的 `find_package(CCCL)` 自动发现，并可在 conda 环境中的任意编译器中使用。  \n更多信息请参阅 [conda-forge 入门指南](https:\u002F\u002Fconda-forge.org\u002Fdocs\u002Fuser\u002Fintroduction\u002F)。\n\n如果你想使用与特定 CUDA Toolkit（例如 CUDA 12.4）配套发布的 CCCL 版本，可以运行以下命令安装：\n\n```bash\nconda config --add channels conda-forge\nconda install cuda-cccl cuda-version=12.4\n```\n\n`cuda-cccl` 元包（metapackage）会安装与 `cuda-version` 对应的 CUDA Toolkit 中配套发布的 `cccl` 版本。  \n如果你在安装 `cuda-cccl` 后希望升级到最新的 `cccl`，请先卸载 `cuda-cccl`，再安装 `cccl`：\n\n```bash\nconda uninstall cuda-cccl\nconda install -c conda-forge cccl\n```\n\n> **注意**  \n> 还存在名为 `cuda-cccl_linux-64` 等的 conda 包。  \n> 这些包包含作为 CUDA Toolkit 一部分发布的 CCCL 版本，但专为 CUDA Toolkit 内部使用而设计。  \n> 为了与 conda 编译器兼容，请安装 `cccl` 或 `cuda-cccl`。  \n> 更多信息请参阅 [cccl conda-forge recipe](https:\u002F\u002Fgithub.com\u002Fconda-forge\u002Fcccl-feedstock\u002Fblob\u002Fmain\u002Frecipe\u002Fmeta.yaml)。\n\n##### CMake 集成\n\nCCCL 使用 [CMake](https:\u002F\u002Fcmake.org\u002F) 构建和安装基础设施，包括测试以及供其他 CMake 项目链接的目标。  \n因此，推荐使用 CMake 将 CCCL 集成到其他项目中。\n\n关于如何通过 CMake Package Manager 实现集成的完整示例，请参阅 [我们的基础示例项目](examples\u002Fbasic)。\n\n其他构建系统理论上也可工作，但只有 CMake 经过测试。  \n欢迎提交改进 CCCL 在其他构建系统中集成体验的贡献。\n\n### 贡献者\n\n有兴趣为 CCCL 的改进做出贡献？请查阅我们的 [贡献指南](CONTRIBUTING.md)，其中全面介绍了设置开发环境、修改代码、运行测试和提交 PR 所需了解的一切内容。\n\n## 平台支持\n\n**目标**：本节描述用户可以在哪些平台上预期 CCCL 能成功编译和运行。\n\n通常情况下，CCCL 应能在 CUDA Toolkit 支持的所有平台上正常工作，但细节决定成败。  \n以下各小节详细说明了不同版本的 CUDA Toolkit、主机编译器（host compilers）和 C++ 方言（C++ dialects）的支持与测试情况。\n\n### CUDA Toolkit (CTK) 兼容性\n\n**摘要：**\n- 最新版本的 CCCL（CUDA C++ Core Libraries）向后兼容当前及上一个 CTK 主版本系列\n- CCCL 从不向前兼容任何版本的 CTK。请始终使用与您的 CTK 所附带版本相同或更新的 CCCL 版本\n- CCCL 的次版本升级不会破坏现有代码，但新功能可能不支持所有 CTK 版本\n\n我们鼓励 CCCL 用户充分利用最新改进，并通过始终使用最新版 CCCL 来 [\"live at head\"](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=tISy7EJQPzI)（即始终使用主干最新代码）。\n为了获得无缝体验，您可以独立于整个 CUDA Toolkit 升级 CCCL。\n这是可行的，因为 CCCL 对当前和上一个主版本系列中每个 CTK 次版本的最新补丁版本都保持向后兼容。\n在某些特殊情况下，所支持的 CUDA Toolkit 发布版本的最低次版本可能需要比其主版本系列中最旧的发布版本更新。\n\n当新的 CTK 主版本发布时，我们将停止对最旧受支持主版本的支持。\n\n| CCCL 版本 | 支持的 CUDA Toolkit 版本                  |\n|-----------|------------------------------------------|\n| 2.x       | 11.1 - 11.8, 12.x（仅最新补丁版本）      |\n| 3.x       | 12.x, 13.x（仅最新补丁版本）             |\n\n使用最新 CCCL 编写的[行为良好的代码](#compatibility-guidelines)应能在任何受支持的 CTK 版本上成功编译和运行。\n对于依赖新 CTK 功能的新特性，可能会出现例外情况，因此这些特性在较旧版本的 CTK 上将无法工作。\n\n用户可以将新版 CCCL 集成到旧版 CTK 中，但不能反向操作。\n这意味着旧版 CCCL 与新版 CTK 不兼容。\n换句话说，**CCCL 从不向前兼容 CUDA Toolkit。**\n\n下表总结了 CTK 与 CCCL 的兼容性：\n\n| CTK 版本   | 内置 CCCL 版本        |    目标 CCCL 版本     | 是否支持？ |                           说明                           |\n|:----------:|:---------------------:|:--------------------:|:----------:|:--------------------------------------------------------:|\n|  CTK `X.Y` |  CCCL `MAJOR.MINOR`   | CCCL `MAJOR.MINOR+n` |    ✅     |            部分新功能可能无法使用                        |\n|  CTK `X.Y` |  CCCL `MAJOR.MINOR`   | CCCL `MAJOR+1.MINOR` |    ✅     | 可能存在破坏性变更；部分新功能可能不可用                 |\n|  CTK `X.Y` |  CCCL `MAJOR.MINOR`   | CCCL `MAJOR+2.MINOR` |    ❌     |    CCCL 仅支持两个 CTK 主版本                            |\n|  CTK `X.Y` |  CCCL `MAJOR.MINOR`   | CCCL `MAJOR.MINOR-n` |    ❌     |          CCCL 不具备向前兼容性                           |\n|  CTK `X.Y` |  CCCL `MAJOR.MINOR`   | CCCL `MAJOR-n.MINOR` |    ❌     |          CCCL 不具备向前兼容性                           |\n\n有关 CCCL 版本控制、API\u002FABI 兼容性以及破坏性变更的更多信息，请参阅下方的 [版本控制](#versioning) 章节。\n\n### 操作系统\n\n除非另有说明，CCCL 支持与 CUDA Toolkit 相同的所有操作系统，具体文档如下：\n - [Linux](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-installation-guide-linux\u002Findex.html#system-requirements)\n - [Windows](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-installation-guide-microsoft-windows\u002Findex.html#system-requirements)\n\n### 主机编译器（Host Compilers）\n\n除非另有说明，CCCL 支持与最新 CUDA Toolkit 相同的主机编译器，具体文档如下：\n- [Linux](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-installation-guide-linux\u002Findex.html#host-compiler-support-policy)\n- [Windows](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-installation-guide-microsoft-windows\u002Findex.html#system-requirements)\n\n对于 Linux 上的 GCC，至少需要 7.x 版本。\n\n使用较旧 CUDA Toolkit 时，我们也仅支持最新 CUDA Toolkit 的主机编译器，\n但至少会支持任何受支持的旧版 CUDA Toolkit 中最新的主机编译器。\n\n我们可能会保留对额外编译器的支持，并接受社区提供的合理修复补丁。\n但我们不会投入大量时间对旧编译器的问题进行排查或修复。\n\n本着“只支持经过测试的内容”的原则，有关我们确切测试内容的更多信息，请参阅我们的 [CI 概览](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fblob\u002Fmain\u002Fci-overview.md)。\n\n### GPU 架构\n\nCCCL 支持所有 [*当前* 主版本 CUDA Toolkit (CTK) 所支持的 GPU 架构](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus)。\n\n明确说明：尽管 CCCL 可以使用当前和上一个 CTK 主版本进行编译，但我们不会测试或验证仅在旧版 CTK 中支持的架构。\n\n这些架构可能仍然可以工作——我们不会故意破坏它们——但它们不在我们的常规 CI 覆盖范围内。此外，新功能也无法保证在这些架构上正常工作。\n\n我们欢迎社区贡献合理的修复方案，以帮助这些旧架构上的用户解决问题。\n\n例如，CCCL 3.0 支持使用 CTK 12.x 和 13.x 进行编译，其中：\n- CUDA Toolkit 13.x 支持 `>=sm_75`\n- CUDA Toolkit 12.x 支持 `>=sm_50`\n\n在此场景下，使用 CTK 12.x 编译 CCCL 3.0 并针对低于 `sm_75` 的架构可能是可行的，但这些配置不属于我们的常规测试范围。\n\n### C++ 方言（Dialects）\n- C++17\n- C++20\n\n### 测试策略\n\nCCCL 的测试策略在尽可能覆盖更多配置与保持合理的 CI 时间之间取得平衡。\n\n对于 CUDA Toolkit 版本，测试会同时针对最旧和最新的受支持版本进行。\n例如，如果 CUDA Toolkit 的最新版本是 12.6，则测试会针对 11.1 和 12.6 进行。\n对于每个 CUDA 版本，会使用所有受支持的主机编译器和所有受支持的 C++ 方言完成构建。\n\n测试策略和矩阵在不断演进。\n[`ci\u002Fmatrix.yaml`](ci\u002Fmatrix.yaml) 文件中定义的矩阵是权威信息来源。\n有关我们 CI 流水线的更多信息，请参阅 [此处](ci-overview.md)。\n\n## 版本管理（Versioning）\n\n**目标：** 本节描述了 CCCL 的版本管理方式、API\u002FABI 稳定性保证以及兼容性指南，以尽量减少升级时的麻烦。\n\n**摘要**\n- CCCL 的整个 API 在所有组件中共享一个统一的语义化版本（semantic version）\n- 仅最新发布的版本受支持，修复不会向后移植到之前的版本\n- API 的破坏性变更（breaking changes）和 CCCL 主版本号（major version）的递增，仅会在 CUDA Toolkit 发布新主版本时同步进行\n- 并非所有源代码层面的破坏性变更都被视为公共 API 的破坏性变更，从而触发主版本号的递增\n- 不要依赖 `cub::` 或 `thrust::` 命名空间中实体的 ABI（Application Binary Interface，应用程序二进制接口）稳定性\n- `cuda::` 命名空间中的符号可能发生 ABI 破坏性变更，但会通过递增嵌入在 `cuda::` 符号内联命名空间（inline namespace）中的 ABI 版本来体现。多个 ABI 版本可能被同时支持。\n\n**注意：** 在将 Thrust、CUB 和 libcudacxx 合并到本仓库之前，每个库都根据语义化版本规范独立进行版本管理。  \n从 2.1 版本开始，这三个库在其各自的仓库中同步了发布版本号。  \n今后，CCCL 将继续使用单一的 [语义化版本](https:\u002F\u002Fsemver.org\u002F) 进行发布，其中 2.2.0 是首个来自 [nvidia\u002Fcccl](www.github.com\u002Fnvidia\u002Fcccl) 仓库的版本。\n\n### 破坏性变更（Breaking Change）\n\n破坏性变更是指在已发布版本之间，对**明确支持**的功能所做的更改，该更改要求用户必须进行额外工作才能升级到新版本。\n\n从极限角度看，[_任何_更改](https:\u002F\u002Fwww.hyrumslaw.com\u002F) 都有可能在某处破坏某些用户的使用。  \n因此，并非所有可能导致源代码不兼容的更改都被视为公共 API 的破坏性变更，从而触发主版本号的递增。\n\n以下各节详细描述了 CCCL API 和 ABI 的破坏性变更细节。\n\n### 应用程序编程接口（Application Programming Interface, API）\n\nCCCL 的公共 API 是指为提供库功能而**有意暴露**给用户的全部功能。\n\n换句话说，CCCL 的公共 API 不仅限于函数签名，还包括（但不限于）：\n- 用户代码中可直接包含的头文件的位置和名称\n- 用户代码中可直接使用的命名空间\n- 位于头文件中、供用户代码直接使用的函数、类和变量的声明和\u002F或定义\n- 供用户代码直接使用的函数、类和变量的语义\n\n此外，CCCL 的公共 API **不包括**以下内容：\n- 任何以下划线 `_` 或双下划线 `__` 开头的符号\n- 名称中包含 `detail` 的任何符号，包括 `detail::` 命名空间或宏\n- 位于 `detail\u002F` 目录或其子目录中的任何头文件\n- 公共 API 中任何头文件隐式包含的头文件\n\n总体而言，我们的目标是避免对公共 API 做出破坏性更改。  \n只有当此类更改能为用户提供更好的性能、更易理解的 API 和\u002F或更一致的 API 时，才会进行。\n\n任何对公共 API 的破坏性变更都将导致 CCCL 主版本号的递增。  \n根据 [CUDA 次版本兼容性（CUDA Minor Version Compatibility）](https:\u002F\u002Fdocs.nvidia.com\u002Fdeploy\u002Fcuda-compatibility\u002F#minor-version-compatibility) 的原则，  \nAPI 破坏性变更和 CCCL 主版本号的递增，仅会在 CUDA Toolkit 发布新主版本时同步发生。\n\n不属于公共 API 的任何内容都可能随时更改，且不另行通知。\n\n#### API 版本管理\n\nCCCL 所有组件的公共 API 共享统一的语义化版本号 `MAJOR.MINOR.PATCH`。\n\n仅最新发布的版本受支持。  \n原则上，新功能和错误修复不会向后移植到先前发布的版本或分支中。\n\n推荐的版本查询方式是使用如下所述的 `CCCL_[MAJOR\u002FMINOR\u002FPATCH_]VERSION` 宏。  \n为了向后兼容，Thrust\u002FCUB\u002Flibcudacxx 的版本定义仍然可用，并且始终与 `CCCL_VERSION` 保持一致。  \n请注意，Thrust\u002FCUB 使用 `MMMmmmpp` 格式，而 CCCL 和 libcudacxx 使用 `MMMmmmppp` 格式。\n\n|                        | CCCL                                   | libcudacxx                                | Thrust                       | CUB                       |\n|------------------------|----------------------------------------|-------------------------------------------|------------------------------|---------------------------|\n| 头文件                 | `\u003Ccuda\u002Fversion>`                       | `\u003Ccuda\u002Fstd\u002Fversion>`                      | `\u003Cthrust\u002Fversion.h>`         | `\u003Ccub\u002Fversion.h>`         |\n| 主版本号               | `CCCL_MAJOR_VERSION`                   | `_LIBCUDACXX_CUDA_API_VERSION_MAJOR`      | `THRUST_MAJOR_VERSION`       | `CUB_MAJOR_VERSION`       |\n| 次版本号               | `CCCL_MINOR_VERSION`                   | `_LIBCUDACXX_CUDA_API_VERSION_MINOR`      | `THRUST_MINOR_VERSION`       | `CUB_MINOR_VERSION`       |\n| 补丁\u002F次次版本号        | `CCCL_PATCH_VERSION`                   | `_LIBCUDACXX_CUDA_API_VERSION_PATCH`      | `THRUST_SUBMINOR_VERSION`    | `CUB_SUBMINOR_VERSION`    |\n| 拼接版本号             | `CCCL_VERSION (MMMmmmppp)`             | `_LIBCUDACXX_CUDA_API_VERSION (MMMmmmppp)`| `THRUST_VERSION (MMMmmmpp)`  | `CUB_VERSION (MMMmmmpp)`  |\n\n### 应用程序二进制接口（ABI, Application Binary Interface）\n\n应用程序二进制接口（ABI）是一组规则，用于规定：\n- 库的组件在机器码中如何表示\n- 这些组件如何在不同的翻译单元（translation units）之间交互\n\n一个库的 ABI 包括但不限于以下内容：\n- 函数和类型的名称修饰（mangled names）\n- 对象和类型的大小与对齐方式（size and alignment）\n- 对象二进制表示中字节的语义（semantics）\n\n**ABI 破坏性变更（ABI Breaking Change）** 是指任何导致公共 API 中函数或类型的 ABI 发生变化的修改。  \n例如，向结构体（struct）中添加一个新的数据成员就是一种 ABI 破坏性变更，因为它改变了该类型的大小。\n\n在 CCCL 中，关于 ABI 的保证如下：\n\n- `thrust::` 和 `cub::` 命名空间中的符号可能在任何时候无预警地破坏 ABI。\n- `thrust::` 和 `cub::` [符号的 ABI 包含了编译时所用的 CUDA 架构](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fcub\u002Fdeveloper_overview.html#symbols-visibility)。因此，如果满足以下任一条件，`thrust::` 或 `cub::` 符号可能会有不同的 ABI：\n    - 使用不同的架构进行编译\n    - 作为 CUDA 源文件（`-x cu`）而非 C++ 源文件（`-x cpp`）进行编译\n- `cuda::` 命名空间中的符号也可能随时破坏 ABI。然而，`cuda::` 符号内嵌了一个 ABI 版本号，每当发生 ABI 破坏时该版本号会递增。多个 ABI 版本可以同时被支持，因此用户可以选择回退到之前的 ABI 版本。更多信息请参见 [此处](libcudacxx\u002Fdocs\u002Freleases\u002Fversioning.md)。\n\n**谁需要关心 ABI？**\n\n通常，只有当 CCCL 用户构建或使用一个二进制制品（如共享库），且其 API 直接或间接包含 CCCL 提供的类型时，才需要关注 ABI 问题。\n\n例如，假设 `libA.so` 是使用 CCCL 版本 `X` 构建的，其公共 API 包含如下函数：\n```c++\nvoid foo(cuda::std::optional\u003Cint>);\n```\n\n如果另一个库 `libB.so` 使用 CCCL 版本 `Y` 编译，并调用 `libA.so` 中的 `foo` 函数，那么当版本 `X` 与 `Y` 之间存在 ABI 破坏时，该调用可能会失败。  \n与 API 破坏性变更不同，ABI 破坏通常不需要修改代码，只需将所有内容重新编译为使用相同的 ABI 版本即可。\n\n若想进一步了解 ABI 及其重要性，请参阅 [《什么是 ABI，C++ 应该如何应对？》](https:\u002F\u002Fwg21.link\u002FP2028R0)。\n\n### 兼容性指南\n\n如上所述，并非所有可能导致源代码不兼容的变更都构成需要递增 CCCL API 主版本号的“破坏性变更”。\n\n为尽量减少因意外依赖 CCCL 非公共 API 部分而导致的问题，建议用户遵循以下准则：\n\n- 除非针对特定符号另有说明（例如特化 `cuda::std::iterator_traits`），否则不要向 `thrust::`、`cub::`、`nv::` 或 `cuda::` 命名空间中添加任何声明，也不要特化其中的任何模板  \n    - **理由**：如果将来添加了同名的符号或特化，会导致冲突。\n- 不要获取 `thrust::`、`cub::`、`cuda::` 或 `nv::` 命名空间中任何 API 的地址  \n    - **理由**：这会阻碍为这些 API 添加重载。\n- 不要对 `thrust::`、`cub::`、`cuda::` 或 `nv::` 命名空间中的任何 API 进行前向声明  \n    - **理由**：这会阻碍为这些 API 添加重载。\n- 不要直接引用任何以下划线 `_`、双下划线 `__` 开头，或名称中包含 `detail`（包括 `detail::` 命名空间或宏）的符号  \n    - **理由**：这些符号仅供内部使用，可能随时无预警地更改。\n- 使用什么就包含什么。对于你使用的每个 CCCL 符号，请直接 `#include` 声明该符号的头文件。换句话说，不要依赖其他头文件隐式包含的头文件。  \n    - **理由**：内部包含关系可能随时更改。\n\n本部分内容部分参考了 [Abseil 的兼容性指南](https:\u002F\u002Fabseil.io\u002Fabout\u002Fcompatibility)。\n\n## 弃用策略\n\n我们将尽最大努力在对公共 API、ABI 或支持的平台和编译器进行任何破坏性变更之前通知用户。\n\n在适当的情况下，弃用将以可禁用的程序化警告形式出现。\n\n弃用周期将根据变更的影响程度而定，但通常至少持续 2 个次要版本（minor version）发布。\n\n## 与 CTK 版本的对应关系\n\n| CCCL 版本 | CTK 版本 |\n|-----------|----------|\n| 3.2       | 13.2     |\n| 3.1       | 13.1     |\n| 3.0       | 13.0     |\n| 2.8       | 12.9     |\n| 2.7       | 12.8     |\n| 2.5       | 12.6     |\n| 2.4       | 12.5     |\n| 2.3       | 12.4     |\n\n自我测试：https:\u002F\u002Fcuda.godbolt.org\u002Fz\u002FK818M4Y9f\n\n12.4 之前的 CTK 将 Thrust、CUB 和 libcudacxx 作为独立库发布。\n\n| Thrust\u002FCUB\u002Flibcudacxx 版本 | CTK 版本 |\n|----------------------------|----------|\n| 2.2                        | 12.3     |\n| 2.1                        | 12.2     |\n| 2.0\u002F2.0\u002F1.9                | 12.1     |\n| 2.0\u002F2.0\u002F1.9                | 12.0     |\n\n\n## CI 流水线概览\n\n有关 CI 流水线的详细概述，请参阅 [ci-overview.md](ci-overview.md)。\n\n## 相关项目\n\n以下项目与 CCCL 使命（让 CUDA 更加愉悦）相关：\n- [cuCollections](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FcuCollections) - GPU 加速的数据结构，如哈希表\n- [NVBench](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnvbench) - 专为 CUDA 应用定制的基准测试库\n- [stdexec](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Fstdexec) - Senders 异步编程模型的参考实现\n\n## 使用 CCCL 的项目\n\n您的项目是否使用了 CCCL？[欢迎提交 PR 将您的项目添加到此列表中！](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fedit\u002Fmain\u002FREADME.md)\n\n- [AmgX](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FAMGX) - 多重网格线性求解器库\n- [ColossalAI](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI) - 用于编写分布式深度学习模型的工具\n- [cuDF](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fcudf) - 用于 ETL 数据分析的算法和文件读取器\n- [cuGraph](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fcugraph) - 图分析算法\n- [cuML](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fcuml) - 机器学习算法和基础组件（primitives）\n- [cuOpt](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcuopt) - 加速决策优化\n- [CuPy](https:\u002F\u002Fcupy.dev) - GPU 上的 NumPy 与 SciPy\n- [cuSOLVER](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcusolver) - 稠密与稀疏线性求解器\n- [CUSP](https:\u002F\u002Fgithub.com\u002Fcusplibrary\u002Fcusplibrary) - 稀疏矩阵运算、迭代方法和代数多重网格\n- [cuVS](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fcuvs) - 近似聚类与向量搜索\n- [GooFit](https:\u002F\u002Fgithub.com\u002FGooFit\u002FGooFit) - 最大似然拟合库\n- [HeavyDB](https:\u002F\u002Fgithub.com\u002Fheavyai\u002Fheavydb) - SQL 数据库引擎\n- [HOOMD](https:\u002F\u002Fgithub.com\u002Fglotzerlab\u002Fhoomd-blue) - 蒙特卡洛与分子动力学模拟\n- [HugeCTR](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR) - GPU 加速的推荐系统框架\n- [Hydra](https:\u002F\u002Fgithub.com\u002FMultithreadCorner\u002FHydra) - 高能物理数据分析\n- [Hypre](https:\u002F\u002Fgithub.com\u002Fhypre-space\u002Fhypre) - 多重网格线性求解器\n- [LightSeq](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Flightseq) - 序列处理与生成的训练和推理\n- [MatX](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fmatx) - 使用表达式模板（expression templates）提供高效、类似 Python 语法的数值计算库\n- [PyTorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch) - 张量与神经网络计算\n- [Qiskit](https:\u002F\u002Fgithub.com\u002FQiskit\u002Fqiskit-aer) - 量子电路高性能模拟器\n- [QUDA](https:\u002F\u002Fgithub.com\u002Flattice\u002Fquda) - 格点量子色动力学（QCD）计算\n- [RAFT](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fraft) - 机器学习算法和基础组件（primitives）\n- [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) - 大语言模型（LLM）服务框架\n- [TensorFlow](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensorflow) - 端到端机器学习平台\n- [TensorRT](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT) - 深度学习推理\n- [TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) - 优化的大语言模型（LLM）推理\n- [tsne-cuda](https:\u002F\u002Fgithub.com\u002FCannyLab\u002Ftsne-cuda) - 随机邻域嵌入（Stochastic Neighborhood Embedding）库\n- [可视化工具包（VTK, Visualization Toolkit）](https:\u002F\u002Fgitlab.kitware.com\u002Fvtk\u002Fvtk) - 渲染与可视化库\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) - 大语言模型（LLM）推理与服务\n- [XGBoost](https:\u002F\u002Fgithub.com\u002Fdmlc\u002Fxgboost) - 梯度提升机器学习算法","# CCCL 快速上手指南\n\nCUDA Core Compute Libraries（CCCL）是 NVIDIA 提供的统一 CUDA C++ 基础库，整合了 Thrust、CUB 和 libcudacxx 三大核心组件，帮助开发者高效编写高性能 CUDA 代码。\n\n---\n\n## 环境准备\n\n- **操作系统**：Linux（推荐 Ubuntu 20.04+）、Windows 或 macOS（仅支持部分功能）\n- **GPU**：支持 CUDA 的 NVIDIA GPU（计算能力 ≥ 3.5）\n- **依赖项**：\n  - [CUDA Toolkit](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit) 11.8 或更高版本（推荐 12.x）\n  - 支持 C++17 的主机编译器（如 GCC ≥ 9、Clang ≥ 12）\n  - （可选）CMake ≥ 3.20（用于高级集成）\n\n> 💡 国内用户建议通过 [清华镜像站](https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fhelp\u002Fcuda\u002F) 安装 CUDA Toolkit 加速下载。\n\n---\n\n## 安装步骤\n\n### 方法一：使用 CUDA Toolkit（推荐）\n\nCCCL 已内置在 CUDA Toolkit 中，无需额外安装。只需确保已正确安装 CUDA：\n\n```bash\nnvcc --version\n```\n\n编译时直接包含头文件即可（`nvcc` 自动配置路径）。\n\n---\n\n### 方法二：从 GitHub 获取最新版\n\n适用于需要最新特性的开发者：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl.git\n```\n\n编译示例程序：\n\n```bash\nnvcc -Icccl\u002Fthrust -Icccl\u002Flibcudacxx\u002Finclude -Icccl\u002Fcub main.cu -o main\n```\n\n> ⚠️ 注意：使用 `-I` 而非 `-isystem`，避免与 CUDA Toolkit 内置头文件冲突。\n\n---\n\n### 方法三：通过 Conda 安装（推荐国内用户）\n\n```bash\nconda config --add channels conda-forge\nconda install cccl\n```\n\n若需匹配特定 CUDA 版本（如 12.4）：\n\n```bash\nconda install cuda-cccl cuda-version=12.4\n```\n\n---\n\n### 方法四：CMake 安装（适用于项目集成）\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl.git\ncd cccl\ncmake . -DCMAKE_INSTALL_PREFIX=\u002Fusr\u002Flocal\nmake install\n```\n\n或使用预设：\n\n```bash\ncmake --preset install -DCMAKE_INSTALL_PREFIX=\u002Fusr\u002Flocal\ncmake --build --preset install --target install\n```\n\n---\n\n## 基本使用\n\n以下是一个结合 Thrust、CUB 和 libcudacxx 实现并行归约的完整示例：\n\n```cpp\n#include \u003Cthrust\u002Fexecution_policy.h>\n#include \u003Cthrust\u002Fdevice_vector.h>\n#include \u003Ccub\u002Fblock\u002Fblock_reduce.cuh>\n#include \u003Ccuda\u002Fatomic>\n#include \u003Ccuda\u002Fcmath>\n#include \u003Ccuda\u002Fstd\u002Fspan>\n#include \u003Ccstdio>\n\ntemplate \u003Cint block_size>\n__global__ void reduce(cuda::std::span\u003Cint const> data, cuda::std::span\u003Cint> result) {\n  using BlockReduce = cub::BlockReduce\u003Cint, block_size>;\n  __shared__ typename BlockReduce::TempStorage temp_storage;\n\n  int const index = threadIdx.x + blockIdx.x * blockDim.x;\n  int sum = 0;\n  if (index \u003C data.size()) {\n    sum += data[index];\n  }\n  sum = BlockReduce(temp_storage).Sum(sum);\n\n  if (threadIdx.x == 0) {\n    cuda::atomic_ref\u003Cint, cuda::thread_scope_device> atomic_result(result.front());\n    atomic_result.fetch_add(sum, cuda::memory_order_relaxed);\n  }\n}\n\nint main() {\n  int const N = 1000;\n  thrust::device_vector\u003Cint> data(N);\n  thrust::fill(data.begin(), data.end(), 1);\n\n  thrust::device_vector\u003Cint> kernel_result(1);\n  constexpr int block_size = 256;\n  int const num_blocks = cuda::ceil_div(N, block_size);\n  reduce\u003Cblock_size>\u003C\u003C\u003Cnum_blocks, block_size>>>(\n      cuda::std::span\u003Cint const>(thrust::raw_pointer_cast(data.data()), data.size()),\n      cuda::std::span\u003Cint>(thrust::raw_pointer_cast(kernel_result.data()), 1)\n  );\n\n  cudaDeviceSynchronize();\n  int const custom_result = kernel_result[0];\n  int const thrust_result = thrust::reduce(thrust::device, data.begin(), data.end(), 0);\n\n  std::printf(\"Custom kernel sum: %d\\n\", custom_result);\n  std::printf(\"Thrust reduce sum: %d\\n\", thrust_result);\n  assert(custom_result == thrust_result);\n  return 0;\n}\n```\n\n编译并运行：\n\n```bash\nnvcc -Icccl\u002Fthrust -Icccl\u002Flibcudacxx\u002Finclude -Icccl\u002Fcub example.cu -o example\n.\u002Fexample\n```\n\n> ✅ 输出应为：\n> ```\n> Custom kernel sum: 1000\n> Thrust reduce sum: 1000\n> ```","某医疗影像AI团队正在开发一个基于GPU加速的CT图像三维重建模块，需要在CUDA C++中高效实现大规模体素数据的并行归约与排序操作。\n\n### 没有 cccl 时\n- 团队需分别从三个独立仓库（Thrust、CUB、libcudacxx）拉取依赖，版本兼容性问题频发，经常因ABI不一致导致编译失败。\n- 在自定义核函数中实现块级归约时，只能手动编写低效的共享内存同步逻辑，难以达到硬件极限性能。\n- 调用标准并行算法（如reduce、sort）时，需额外配置后端策略，代码冗长且难以维护。\n- 设备端无法直接使用标准库功能（如原子操作、数学函数），需混用CUDA运行时API，降低代码可读性。\n- 调试和构建环境配置复杂，新成员上手周期长达一周。\n\n### 使用 cccl 后\n- 所有核心组件统一集成在一个仓库中，通过单一依赖即可获得完整、版本对齐的CUDA C++基础库，构建稳定性显著提升。\n- 直接调用 `cub::BlockReduce` 等原语，在自定义核函数中轻松实现接近理论峰值的块级归约，性能提升30%以上。\n- 利用 Thrust 的高层并行算法接口，一行代码完成设备向量归约，代码简洁且自动适配GPU架构。\n- 通过 libcudacxx 提供的 `cuda::atomic_ref` 和 `cuda::std::span`，在设备代码中安全使用现代C++标准库风格的抽象，提升可维护性。\n- 借助官方提供的 Dev Container 和 GitHub Codespaces 支持，新开发者5分钟内即可进入可编译调试环境。\n\ncccl 将CUDA C++开发所需的底层原语、并行算法与标准库能力无缝整合，让开发者专注核心算法而非基础设施。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_cccl_f1310bf7.png","NVIDIA","NVIDIA Corporation","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNVIDIA_7dcf6000.png","",null,"https:\u002F\u002Fnvidia.com","https:\u002F\u002Fgithub.com\u002FNVIDIA",[83,87,91,95,99,103,107,111],{"name":84,"color":85,"percentage":86},"C++","#f34b7d",62.7,{"name":88,"color":89,"percentage":90},"Cuda","#3A4E3A",30.9,{"name":92,"color":93,"percentage":94},"Python","#3572A5",3.3,{"name":96,"color":97,"percentage":98},"C","#555555",1.3,{"name":100,"color":101,"percentage":102},"CMake","#DA3434",1.1,{"name":104,"color":105,"percentage":106},"Shell","#89e051",0.5,{"name":108,"color":109,"percentage":110},"Cython","#fedf5b",0.2,{"name":112,"color":113,"percentage":114},"PowerShell","#012456",0.1,2250,373,"2026-04-05T02:19:37","NOASSERTION","Linux, Windows, macOS","需要 NVIDIA GPU，支持 CUDA 的 GPU 架构，CUDA Toolkit 11.7 或更高版本（根据平台支持说明，兼容当前及前一个主版本系列的 CUDA Toolkit）","未说明",{"notes":123,"python":121,"dependencies":124},"CCCL 是纯头文件库，无需编译即可使用；若从 GitHub 使用最新版，需注意不能与比其更新的 CUDA Toolkit 搭配（不支持向前兼容）；推荐通过 CUDA Toolkit、GitHub 源码、Conda 或 CMake 方式集成；使用非 nvcc 编译器时需手动指定头文件路径；建议不要使用 -isystem 而应使用 -I 包含头文件以避免冲突。",[125,100,126,127],"CUDA Toolkit >=11.7","nvcc","gcc 或其他支持 CUDA 的主机编译器",[13],[130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148],"accelerated-computing","cpp","cpp-programming","cuda","cuda-cpp","cuda-kernels","cuda-library","cuda-programming","gpu","gpu-acceleration","gpu-computing","gpu-programming","hpc","nvidia","nvidia-gpu","parallel-algorithm","parallel-computing","parallel-programming","modern-cpp","2026-03-27T02:49:30.150509","2026-04-06T05:15:38.027741",[152,157,162,167,172,176],{"id":153,"question_zh":154,"answer_zh":155,"source_url":156},1586,"thrust::tuple 是否支持变参模板？为什么当前实现有限制？","目前 thrust::tuple 未使用 C++11 变参模板，因此仅支持最多 10 个元素的元组，编译速度慢且错误信息难以阅读。项目计划通过引入 libcu++ 的变参 tuple 实现来改进。用户可通过 THRUST_USE_CXX11 宏或自动检测 __cplusplus >= 201103L 来启用 C++11 实现（该改进已在 PR #262 中合并）。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fissues\u002F695",{"id":158,"question_zh":159,"answer_zh":160,"source_url":161},1587,"any_resource 应该支持拷贝构造和拷贝赋值吗？","官方决定 any_resource 不应支持拷贝操作，因为内存资源可能持有大量已分配内存（如池式分配器），拷贝语义不明确且可能导致状态不一致。any_resource 实际上是 resource_ref 的扩展，大小相同，但拷贝行为已被明确排除，建议在需要延长资源生命周期时使用 shared_resource 而非 any_resource。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fissues\u002F2379",{"id":163,"question_zh":164,"answer_zh":165,"source_url":166},1588,"CCCL C 库为何要将 NVRTC 和 nvJitLink 改为运行时加载（dlopen）而非动态链接？","为了支持 Python 打包、兼容多个 CUDA 主版本（如 cu11\u002Fcu12）并避免导入时 ImportError，NVRTC 和 nvJitLink 应在运行时通过 dlopen 加载。虽然仍需保留 NEEDED 条目以确保库的合法性，但运行时加载可灵活适配不同安装路径和版本。相关实现在 PR #4735 中提供。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fissues\u002F3979",{"id":168,"question_zh":169,"answer_zh":170,"source_url":171},1589,"在 RTX PRO 6000 Blackwell 上调用 thrust::transform 后为何 GPU 内存占用异常高达 4GB？","该问题与使用 compute_120 架构编译时 TMA（Tensor Memory Accelerator）的隐式启用有关。将编译选项从 -gencode arch=compute_120,code=sm_120 改为 -gencode arch=compute_80,code=sm_120 可规避此问题。该 bug 在 CCCL 3.2.0 版本中已修复，建议升级至该版本或更高。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fissues\u002F6708",{"id":173,"question_zh":174,"answer_zh":175,"source_url":161},1590,"如何正确传递 memory_resource 以确保分配的内存生命周期安全？","若函数返回的对象使用了传入的 memory_resource 进行分配，且其生命周期超过函数调用，则应接受 any_resource 或 shared_resource（而非 resource_ref）。尽管 any_resource 表面上支持拷贝，但实际设计中已弃用拷贝语义；shared_resource 更适合共享所有权场景。any_resource 与 resource_ref 大小相同，性能开销相近。",{"id":177,"question_zh":178,"answer_zh":179,"source_url":156},1591,"CCCL 是否计划支持 C++11 变参模板以改进 thrust::tuple？","是的，项目正逐步引入 libcu++ 依赖，以便使用其基于变参模板的 tuple 实现。这将解决当前 thrust::tuple 元素数量限制（最多10个）、编译速度慢和错误信息不友好等问题。用户未来可通过定义 THRUST_USE_CXX11 宏或依赖自动 C++11 检测来启用新实现。",[181,186,191,196,201,206,211,216,221,226,231,236,241,246,251,256,261,266,271,276],{"id":182,"version":183,"summary_zh":184,"released_at":185},101095,"v3.3.0","\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.3.0 -->\r\n\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fcompare\u002Fv3.3.0...v3.3.0\r\n\r\n\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.3.0 -->\r\n\r\n## What's Changed\r\n### 📚 Libcudacxx\r\n* [libcudacxx] Fix a typo in the documentation by @caugonnet in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7330\r\n* Add a test for \u003Cnv\u002Ftarget> to validate old dialect support. by @wmaxey in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7241\r\n### 🔄 Other Changes\r\n* Implement `cudax::cufile` by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6122\r\n* Update linear_congruential_generator with constexpr, tests and a fast discard by @RAMitchell in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6402\r\n* Replace `_CCCL_HAS_CUDA_COMPILER()` with `_CCCL_CUDA_COMPILATION()` by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6399\r\n* Remove unnecessary casts in complex multiplication\u002Fdivision by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6670\r\n* Add benchmark batch script by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6661\r\n* Improvements and testing for inspect_changes CI functionality. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6535\r\n* Improve clarity of CCCL assert macro documentation by @jrhemstad in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6675\r\n* Fix oversubscription issue with lit precompile, label hack by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6554\r\n* Make missing sccache nonfatal. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6582\r\n* Address pending comments for `make_tma_descriptor` by @fbusato in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6662\r\n* Add nvhpc 25.9. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6003\r\n* Test building for all arches. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6113\r\n* Add nvbench_helper tests to CI. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6679\r\n* Add more targets to pytorch build. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6685\r\n* Add host std lib version detection by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6678\r\n* Improve CUB benchmark docs by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6640\r\n* Use `if consteval` in libcu++ by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6424\r\n* Update docs for `_CCCL_IF_CONSTEVAL` by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6692\r\n* Fixes issue with select close to int_max by @elstehle in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6641\r\n* Update libcudacxx C++ dialect handling. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6693\r\n* Simplifies env usage in `DeviceTopK` tests by @elstehle in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6680\r\n* Switch to S3 preprocessor cache by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6561\r\n* fix omp scan bug by @charan-003 in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6560\r\n* Refactor out variant from transform tunings by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6669\r\n* [libcu++] Waive hierarchy constexpr testing on GCC8 by @pciolkosz in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6707\r\n* Use wrapper with `void*` argument types for iterator advance\u002Fdereference signature by @shwina in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6634\r\n* Restore libcudacxx dialect presets. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6705\r\n* Refactor error handling in radix sort dispatch by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6681\r\n* Remove special dialect handling from cudax build system. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6702\r\n* Segmented scan followup by @oleksandr-pavlyk in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6706\r\n* Fix electing leader from any group in `cuda::memcpy_async` by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6710\r\n* Avoid scaling twice in `ReduceNondeterministicPolicy` by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6711\r\n* Remove special handling of C++ dialect in CUB's build system by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6713\r\n* [libcu++] Use resource test fixture members through this by @pciolkosz in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6717\r\n* Improves top-k examples to illustrate stream usage by @elstehle in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6723\r\n* Tweak `sol.py` a bit by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6721\r\n* Implement PCG64 as extension by @RAMitchell in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6292\r\n* Use PDL in cub::DeviceScan by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6639\r\n* Fix header in libcudacxx test by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6726\r\n* Remove dead code. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6725\r\n* Add deps on thrust\u002Fcub to libcudacxx. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6694\r\n* Remove special handling for dialect in Thrust's build system. by @alliepiper in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6722\r\n* [libcu++] Automatically bump up the release thres","2026-02-27T22:39:02",{"id":187,"version":188,"summary_zh":189,"released_at":190},101096,"v3.2.1","\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.2.1 -->\r\n\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fcompare\u002Fv3.2.1...v3.2.1\r\n\r\n\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.2.1 -->\r\n\r\n## What's Changed\r\n### 🔄 Other Changes\r\n* Bump branch\u002F3.2.x to 3.2.1. by @wmaxey in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7329\r\n* [Backport branch\u002F3.2.x] Add accessor methods to shared_resource\u003CT> by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7322\r\n* [Backport branch\u002F3.2.x] Fix clang warning about missing braces again by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7324\r\n* [Backport branch\u002F3.2.x] part deux: make the abi of `__basic_any` compatible between c++17 and c++20 by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7421\r\n* [backport 3.2] Fix missing c2h symbol when compiling with clang-cuda (#7454) by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7600\r\n* [Backport branch\u002F3.2.x] Remove recursion from __internal_is_address_from by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7573\r\n* [Backport branch\u002F3.2.x] Fix `ranges_overlap` for `nvc++ -cuda` by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7598\r\n* [Backport branch\u002F3.2.x] Fix `cuda::device::current_arch_id` by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7601\r\n* [Backport branch\u002F3.2.x] Check for `_GLIBCXX_USE_CXX11_ABI` only when compiling with libstdc++ by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7630\r\n* [Backport branch\u002F3.2.x] Fix cuda::barrier missing accounting of results in try_wait by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7634\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fcompare\u002Fv3.2.0...v3.2.1","2026-02-12T01:03:32",{"id":192,"version":193,"summary_zh":194,"released_at":195},101097,"python-0.5.1","These are the release notes for the `cuda-cccl` Python package version 0.5.1, dated **February 6th, 2026**. The previous release was v0.5.0.\r\n\r\n`cuda-cccl` is in \"experimental\" status, meaning that its API and feature set can change quite rapidly.\r\n\r\n## Installation\r\n\r\nPlease refer to the install instructions [here](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fsetup.html)\r\n\r\n## Features\r\n\r\n## Improvements\r\n\r\n - Restrict to numba-cuda less than 0.27 (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7529)\r\n \r\n## Bug Fixes\r\n\r\n -  Fix caching of functions referencing numpy ufuncs (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7535)\r\n","2026-02-07T10:24:15",{"id":197,"version":198,"summary_zh":199,"released_at":200},101098,"python-0.5.0","These are the release notes for the `cuda-cccl` Python package version 0.5.0, dated **February 5th, 2026**. The previous release was v0.4.5.\r\n\r\n`cuda-cccl` is in \"experimental\" status, meaning that its API and feature set can change quite rapidly.\r\n\r\n## Installation\r\n\r\nPlease refer to the install instructions [here](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fsetup.html)\r\n\r\n## ⚠️ Breaking change\r\n\r\n### Object-based API requires passing operator to algorithm `__call__` method\r\n\r\nThis API change affects only users of the [object-based API (expert mode)](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fcompute.html#object-based-api-expert-mode).\r\n\r\nPreviously, constructing an algorithm object required passing the operator as an argument, but _invoking_ it did not:\r\n\r\n```python\r\n# step 1: create algorithm object\r\ntransformer = cuda.compute.make_unary_transform(d_input, d_output, some_unary_op)\r\n\r\n# step 2: invoke algorithm\r\ntransformer(d_in1, d_out1, num_items1)  # NOTE: not passing some_unary_op here\r\n```\r\n\r\nThe new behaviour requires passing it in both places:\r\n\r\n```python\r\n# step 1: create algorithm object\r\ntransformer = cuda.compute.make_unary_transform(d_input, d_output, some_unary_op)\r\n\r\n# step 2: invoke algorithm\r\ntransformer(d_in1, d_out1, some_unary_op, num_items1)  # NOTE: need to pass some_unary_op here\r\n```\r\n\r\nThis change is introduced because in many situations (such as in a loop), the operator itself and the globals\u002Fclosures it references can change between construction and invocation (or between invocations).\r\n\r\n## Features\r\n\r\n\r\n## Improvements\r\n\r\n - Avoid unnecessary recompilation of stateful operators (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7500)\r\n - Improved cache lookup performance (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7501)\r\n\r\n## Bug Fixes\r\n\r\n -  Fix handling of boolean types in cuda.compute (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7389)\r\n","2026-02-05T14:38:02",{"id":202,"version":203,"summary_zh":204,"released_at":205},101099,"v3.2.0","\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.2.0 -->\r\n\r\nThe CCCL team is excited to announce the 3.2 release of the CUDA Core Compute Library (CCCL) whose highlights include new modern CUDA C++ runtime APIs and new speed-of-light algorithms including Top-K.\r\n\r\n## Modern CUDA C++ Runtime\r\nCCCL 3.2 broadly introduces new, [idiomatic C++ interfaces for core CUDA](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fruntime.html) runtime and driver functionality. \r\n\r\nIf you’ve written CUDA C++ for a while, you’ve likely built (or adopted) some form of convenience wrappers around today’s C-like APIs like cudaMalloc or cudaStreamCreate. \r\n\r\nThe new APIs added in CCCL 3.2 are meant to provide the productivity and safety benefits of C++ for core CUDA constructs so you can spend less time reinventing wrappers and more time writing kernels and algorithms.\r\n\r\n### Highlights:\r\n- New convenient vocabulary types for core CUDA concepts (`cuda::stream`, `cuda::event`, `cuda::arch_traits`)\r\n- Easier memory management with [Memory Resources](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fextended_api\u002Fmemory_resource.html#libcudacxx-extended-api-memory-resources) and `cuda::buffer`\r\n- More powerful and convenient kernel launch with `cuda::launch`\r\n\r\nExample (vector add, revisited):\r\n```cpp\r\ncuda::device_ref device = cuda::devices[0];\r\ncuda::stream stream{device};\r\nauto pool = cuda::device_default_memory_pool(device);\r\n\r\nint num_elements = 1000;\r\nauto A = cuda::make_buffer\u003Cfloat>(stream, pool, num_elements, 1.0);\r\nauto B = cuda::make_buffer\u003Cfloat>(stream, pool, num_elements, 2.0);\r\nauto C = cuda::make_buffer\u003Cfloat>(stream, pool, num_elements, cuda::no_init);\r\n\r\nconstexpr int threads_per_block = 256;\r\nauto config = cuda::distribute\u003Cthreads_per_block>(num_elements);\r\nauto kernel = [] __device__ (auto config, cuda::std::span\u003Cconst float> A, \r\n                                            cuda::std::span\u003Cconst float> B, \r\n                                            cuda::std::span\u003Cfloat> C){\r\n    auto tid = cuda::gpu_thread.rank(cuda::grid, config);\r\n    if (tid \u003C A.size())\r\n        C[tid] = A[tid] + B[tid];\r\n};\r\ncuda::launch(stream, config, kernel, config, A, B, C);\r\n```\r\n[(Try this example live on Compiler Explorer!)](https:\u002F\u002Fgodbolt.org\u002Fz\u002Fsbj6jvj3a)\r\n\r\nA forthcoming blog post will go deeper into the details, the design goals, intended usage patterns, and how these new APIs fit alongside existing CUDA APIs.\r\n\r\n## New Algorithms\r\n### Top-K Selection\r\nCCCL 3.2 introduces `cub::DeviceTopK` (for example, `cub::DeviceTopK::MaxKeys`) to select the K largest (or smallest) elements without sorting the entire input. For workloads where K is small, this can deliver up to 5X speedups over a full radix sort, and can reduce memory consumption when you don’t need sorted results. \r\n\r\nTop‑K is an active area of ongoing work for CCCL: our roadmap includes planned segmented Top‑K as well as block‑scope and warp‑scope Top‑K variants. See what’s planned and tell us what Top‑K use cases matter most in [CCCL GitHub issue #5673](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fissues\u002F5673).\r\n\r\n\u003Cimg width=\"512\" height=\"268\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F75d2b0d0-58d8-49cc-9f74-e30e10a45b1a\" \u002F>\r\n\r\n### Fixed-size Segmented Reduction\r\nCCCL 3.2 now provides a new [cub::DeviceSegmentedReduce](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fcub\u002Fapi\u002Fstructcub_1_1DeviceSegmentedReduce.html#_CPPv4I0000EN3cub21DeviceSegmentedReduce6ReduceE11cudaError_tPvR6size_t14InputIteratorT15OutputIteratorTN4cudaSt7int64_tEi12ReductionOpT1T12cudaStream_t) variant that accepts a uniform segment_size, eliminating offset iterator overhead in the common case when segments are fixed-size. This enables optimizations for both small segment sizes (up to 66x) and large segment sizes (up to 14x). \r\n\r\n```cpp\r\n\u002F\u002F New API accepts fixed segment_size instead of per-segment begin\u002Fend offsets\r\ncub::DeviceSegmentedReduce::Sum(d_temp, temp_bytes, input, output,  \r\n                                num_segments, segment_size); \r\n```\r\n\r\n\r\n\u003Cimg width=\"512\" height=\"261\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd67db726-741e-4541-9fed-c2a5277281df\" \u002F>\r\n\r\n### Additional New Algorithms in CCCL 3.2\r\n[Segmented Scan](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fcub\u002Fapi\u002Fstructcub_1_1DeviceSegmentedScan.html) - cub::DeviceSegmentedScan provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments. \r\n\r\n[Binary Search](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fcub\u002Fapi\u002Fstructcub_1_1DeviceFind.html#cub-devicefind) - cub::DeviceFind::[Upper\u002FLowerBound] performs a parallel search for multiple values in an ordered sequence. \r\n\r\n[Search](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fcub\u002Fapi\u002Fstructcub_1_1DeviceFind.html#cub-devicefind) - cub::DeviceFind::FindIf searches the unordered input for the first element that satisfies a given condition. Thanks to its early-exit logic, it can be up to 7x faster than searching the entire sequence.\r\n\r\n\r\n**Full Changelog**: https:\u002F","2026-02-05T21:55:24",{"id":207,"version":208,"summary_zh":209,"released_at":210},101100,"python-0.4.5","These are the release notes for the `cuda-cccl` Python package version 0.4.5, dated **January 23rd, 2026**. The previous release was v0.4.4.\r\n\r\n`cuda-cccl` is in \"experimental\" status, meaning that its API and feature set can change quite rapidly.\r\n\r\n## Installation\r\n\r\nPlease refer to the install instructions [here](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fsetup.html)\r\n\r\n## Features\r\n\r\n  - Add cuda.compute APIs for upper_bound and lower_bound (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7250)\r\n  - Support lambdas as operators in cuda.compute (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7058)\r\n\r\n## Improvements\r\n\r\n  - Consolidate caching logic across cuda.compute algorithms (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7281)\r\n  - Allow multiple uses of the same function in one compilation (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7072)\r\n  - Make cuda.compute importable in CPU-only environments (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7171)\r\n  - Improve cuda.compute documentation (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7061)\r\n  - Update Python package versioning flow (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fcommit\u002F96f98db7ad672c27cc3a52a3bcb78d72c92a4dd2)\r\n\r\n## Bug Fixes\r\n\r\n  - Fix deferred annotations handling (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7321, https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7121)\r\n  - Disable LDL\u002FSTL checks to avoid NVRTC 13.1 failures (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7054)\r\n  - Fix documentation build issues (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7122)\r\n  - Fix Python-related docs (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F7052)","2026-01-23T16:12:14",{"id":212,"version":213,"summary_zh":214,"released_at":215},101101,"python-0.4.3","These are the release notes for the `cuda-cccl` Python package version 0.4.3, dated **December 18th, 2025**. The previous release was v0.4.2.\r\n\r\n`cuda.cccl` is in \"experimental\" status, meaning that its API and feature set can change quite rapidly.\r\n\r\n## Installation\r\n\r\nPlease refer to the install instructions [here](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fsetup.html)\r\n\r\n## Features\r\n\r\n-\r\n\r\n## Improvements and bug fixes\r\n- Add missing OpKind docs entries (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6910)\r\n- Unify operator handling in cuda.compute (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6938)\r\n- [cuda.compute] Refactor code for creating void* wrappers (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6941)\r\n- Remove need for hardcoded `LevelT` for histogram in c.parallel and cuda.compute (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6915)\r\n- c.parallel: reuse CUB agent policies for histogram (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6974)\r\n- [cuda.compute]: fix alignment not being set properly for `gpu_struct` types (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6995)","2025-12-18T14:54:50",{"id":217,"version":218,"summary_zh":219,"released_at":220},101102,"v3.1.4","\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.1.4 -->\r\n\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fcompare\u002Fv3.1.4...v3.1.4\r\n\r\n\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.1.4 -->\r\n\r\n## What's Changed\r\n### 🔄 Other Changes\r\n* [Backport 3.1] Add sm_62 arch traits (#6772) by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6779\r\n* [Backport branch\u002F3.1.x] Fix arch related `cuda::device::` APIs for nvhpc in CUDA mode by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6831\r\n* [Backport to 3.1] Ensure test kernels remain active during allocator testing. (#5899) by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6849\r\n* Bump branch\u002F3.1.x to 3.1.4. by @wmaxey in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6764\r\n* [Backport branch\u002F3.1.x] [libcu++] Fix minor version compatibility in 13.X by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6917\r\n* [Backport 3.1] Properly specialize cub functions for `__nv_bfloat16` (#6931) by @miscco in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6945\r\n* [Branch 3.1] Backport current nvtarget changes by @miscco in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6977\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fcompare\u002Fv3.1.3...v3.1.4","2026-01-21T22:26:16",{"id":222,"version":223,"summary_zh":224,"released_at":225},101103,"python-0.4.2","These are the release notes for the `cuda-cccl` Python package version 0.4.2, dated **December 9th, 2025**. The previous release was v0.4.1.\r\n\r\n`cuda.cccl` is in \"experimental\" status, meaning that its API and feature set can change quite rapidly.\r\n\r\n## Installation\r\n\r\nPlease refer to the install instructions [here](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fsetup.html)\r\n\r\n## Features\r\n\r\n-\r\n\r\n## Improvements and bug fixes\r\n\r\n- Add explicit dependency on nvidia-nvvm (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6909 )","2025-12-09T16:58:54",{"id":227,"version":228,"summary_zh":229,"released_at":230},101104,"python-0.4.1","These are the release notes for the `cuda-cccl` Python package version 0.4.1, dated **December 8th, 2025**. The previous release was v0.4.0.\r\n\r\n`cuda.cccl` is in \"experimental\" status, meaning that its API and feature set can change quite rapidly.\r\n\r\n## Installation\r\n\r\nPlease refer to the install instructions [here](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fsetup.html)\r\n\r\n## Features\r\n\r\n-\r\n\r\n## Improvements and bug fixes\r\n\r\n- Fix issue with `get_dtype()` not working anymore for pytorch arrays (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6882)\r\n- Add fast path to extract PyTorch array pointer (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6884)\r\n\r\n## Breaking Changes\r\n\r\n-","2025-12-08T10:06:45",{"id":232,"version":233,"summary_zh":234,"released_at":235},101105,"python-0.4.0","These are the release notes for the `cuda-cccl` Python package version 0.4.0, dated **December 3rd, 2025**. The previous release was v0.3.4.\r\n\r\n`cuda.cccl` is in \"experimental\" status, meaning that its API and feature set can change quite rapidly.\r\n\r\n## Installation\r\n\r\nPlease refer to the install instructions [here](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fsetup.html)\r\n\r\n## Features\r\n\r\n* Added `select` algorithm for filtering data (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6766)\r\n* Support for nested structs (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6353)\r\n* Added `DiscardIterator` (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6618)\r\n* The `cccl-python` Python package can now be installed via conda (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6513)\r\n\r\n## Improvements and bug fixes\r\n\r\n- Allow numpy struct types as initial value for Zipiterator inputs ([#6861](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6861))\r\n- Allow using ZipIterator as an output in cuda.compute ([#6518](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6518))\r\n- Enable caching of advance\u002Fdereference methods for Zipiterator and PermutationIterator ([#6753](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6753))\r\n- Use wrapper with `void*` argument types for iterator advance\u002Fdereference signature ([#6634](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6634))\r\n- Fixes and improvements to function caching ([#6758](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6758))\r\n- Fix handling of wrapped cuda.jit functions ([#6770](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6770))\r\n- Use annotations if available to determine return type of transform op ([#6760](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6760))\r\n- Allow passing in `None` as init value for scan when using an iterator as input ([#6499](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6499))\r\n\r\n## Breaking Changes\r\n\r\n-","2025-12-03T21:51:42",{"id":237,"version":238,"summary_zh":239,"released_at":240},101106,"v3.1.3","\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.1.3 -->\r\n\r\n## What's Changed\r\n### 🔄 Other Changes\r\n* [Backport branch\u002F3.1.x] Fix invalid reference type of `cuda::strided_iterator` by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6517\r\n* [Backport branch\u002F3.1.x] Fixes issue with select close to int_max by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6700\r\n* Bump branch\u002F3.1.x to 3.1.3. by @wmaxey in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6621\r\n* Backport changes for XGBoost compatibility by @bdice in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6727\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fcompare\u002Fv3.1.2...v3.1.3","2025-11-24T18:09:01",{"id":242,"version":243,"summary_zh":244,"released_at":245},101107,"v3.1.2","\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.1.2 -->\r\n\r\n## What's Changed\r\n### 🔄 Other Changes\r\n* [BACKPORT 3.1] Always include `\u003Cnew>` when we need operator new for clang-cuda (#6310) by @miscco in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6445\r\n* [Backport branch\u002F3.1.x] Fix offset_iterator tests by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6446\r\n* [BACKPORT 3.1] Add `_CCCL_DECLSPEC_EMPTY_BASES` to mdspan features (#6444) by @miscco in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6449\r\n* Bump branch\u002F3.1.x to 3.1.2. by @wmaxey in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6433\r\n* [Backport 3.1] Fix clang 21 issues (#6404) by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6447\r\n* [Backport branch\u002F3.1.x] Ensure that `detect_wrong_difference` is a valid output iterator by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6453\r\n* [Backport to 3.1] Fix `cub.bench.radix_sort.keys.base` regression on H200 (#6452) by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6458\r\n* [Backport 3.1] Do not mark deduction guides as hidden (#6350) by @miscco in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6457\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fcompare\u002Fv3.1.1...v3.1.2","2025-11-13T19:14:08",{"id":247,"version":248,"summary_zh":249,"released_at":250},101108,"python-0.3.4","These are the release notes for the `cuda-cccl` Python package version 0.3.4, dated **November 5th, 2025**. The previous release was v0.3.3.\r\n\r\n`cuda.cccl` is in \"experimental\" status, meaning that its API and feature set can change quite rapidly.\r\n\r\n## Installation\r\n\r\nPlease refer to the install instructions [here](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fsetup.html)\r\n\r\n## Features and improvements\r\n\r\n* Introduced [`cuda.compute.segmented_sort`](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fcompute_api.html#cuda.compute.algorithms.segmented_sort) API.\r\n\r\n## Bug Fixes\r\n\r\n-\r\n\r\n## Breaking Changes\r\n\r\n-","2025-11-05T18:46:27",{"id":252,"version":253,"summary_zh":254,"released_at":255},101109,"v3.1.1","\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.1.1 -->\r\n\r\n## What's Changed\r\n### 🔄 Other Changes\r\n* Bump branch\u002F3.1.x to 3.1.1. by @wmaxey in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6235\r\n* [Backport branch\u002F3.1.x] Fix `__compressed_movable_box` by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6248\r\n* [Backport branch\u002F3.1.x] Fix `__is_primary_std_template` for libc++ by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6249\r\n* [Backport 3.1] Fix invalid refactoring of  #4377 (#6246) by @miscco in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6265\r\n* [Backport branch\u002F3.1.x] Fix using `char` as the index type of `tabulate_output_iterator` by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6273\r\n* [Backport 3.1]: Fix missing qualifications for `__construct_at` (#6270) by @miscco in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6274\r\n* [Backport branch\u002F3.1.x] Fix missed constructor with compressed box by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6272\r\n* [Backport 3.1] Fix `string_view` construction from `std::string_view` (#6291) by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6301\r\n* [Backport 3.1] Include `\u003Cmath.h>` in `\u003Ccuda\u002Fstd\u002Fcmath>` headers unconditionally (#6333) by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6339\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fcompare\u002Fv3.1.0...v3.1.1","2025-11-13T19:13:41",{"id":257,"version":258,"summary_zh":259,"released_at":260},101110,"python-0.3.3","These are the release notes for the `cuda-cccl` Python package version 0.3.3, dated **October 21st, 2025**. The previous release was v0.3.2.\r\n\r\n`cuda.cccl` is in \"experimental\" status, meaning that its API and feature set can change quite rapidly.\r\n\r\n## Installation\r\n\r\nPlease refer to the install instructions [here](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fsetup.html)\r\n\r\n## Features and improvements\r\n\r\n* This is the first release that features Windows wheels published to PyPI.  You can now `pip install cuda-cccl[cu12]` or `pip install cuda-cccl[cu13]` on Windows for Python versions 3.10, 3.11, 3.12, and 3.13.\r\n\r\n## Bug Fixes\r\n\r\n-\r\n\r\n## Breaking Changes\r\n\r\n-","2025-10-21T21:44:03",{"id":262,"version":263,"summary_zh":264,"released_at":265},101111,"python-0.3.2","These are the release notes for the `cuda-cccl` Python package version 0.3.2, dated **October 17th, 2025**. The previous release was v0.3.1.\r\n\r\n`cuda.cccl` is in \"experimental\" status, meaning that its API and feature set can change quite rapidly.\r\n\r\n## Installation\r\n\r\nPlease refer to the install instructions [here](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fsetup.html)\r\n\r\n## Features and improvements\r\n\r\n* Allow passing in a device array or `None` as the initial value in scan.\r\n\r\n## Bug Fixes\r\n\r\n-\r\n\r\n## Breaking Changes\r\n\r\n-","2025-10-20T23:17:44",{"id":267,"version":268,"summary_zh":269,"released_at":270},101112,"v3.1.0","\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.1.0 -->\r\n\r\n## Highlights\r\n\r\n### New options for deterministic reductions in `cub::DeviceReduce`\r\n\r\nDue to non-associativity of floating point addition, `cub::DeviceReduce` historically only guaranteed bitwise identical results run-to-run on the same GPU.\r\n\r\nStarting with CCCL 3.1, formalizes three different levels of determinism with difference performance trade-offs\r\n- Not-guaranteed (new!) - new single-pass reduction using atomics\r\n- Run-to-run (status quo) - existing two-pass implementation\r\n- GPU-to-GPU (new!) - based on reproducible reduction in @maddyscientis [GTC 2024 talk](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtc24-s62405\u002F) \r\n\r\n```\r\n\u002F\u002F Pick your desired trade-off of performance and determinism\r\n\u002F\u002F auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed);\r\n\u002F\u002F auto env = cuda::execution::require(cuda::execution::determinism::run_to_run);\r\n\u002F\u002F auto env = cuda::execution::require(cuda::execution::determinism::gpu_to_gpu);\r\ncub::DeviceReduce::Sum(..., env);\r\n```\r\n\u003Cimg width=\"800\" height=\"525\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fdfbc440d-4026-4be7-8ae0-b5823dbaa5d1\" \u002F>\r\n\u003Cimg width=\"800\" height=\"525\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fa74395f3-f617-4356-b046-b16ed85e2f11\" \u002F>\r\n\r\n\u003Cmeta charset=\"utf-8\">\u003Cb style=\"font-weight:normal;\" id=\"docs-internal-guid-469cd832-7fff-8f2b-e4eb-8e579a763196\">\u003Cdiv dir=\"ltr\" style=\"margin-left:0pt;\" align=\"center\">\r\n  | Not-Guaranteed (new!) | Run-to-run (status quo) | GPU-to-GPU (new!)\r\n-- | -- | -- | --\r\nDeterminism | Varies per run | Varies per GPU | Constant\r\nPerformance | Best | Better | Good\r\n\r\n\u003C\u002Fdiv>\u003C\u002Fb>\r\n\r\n\r\n### More convenient single-phase CUB APIs\r\n\r\nNearly every CUB algorithm requires temporary storage for intermediate scratch space to carry out the algorithm.\r\n\r\nHistorically, it was the users responsibility to query and allocate the necessary temporary storage through a two-phase call pattern that is cumbersome and error-prone if arguments aren’t passed the same between two invocations.\r\n\r\nCCCL 3.1 adds new overloads of some CUB algorithms that accept a memory resource so you skip the temp-storage query\u002Fallocate\u002Ffree pattern. \r\n\r\nBefore\r\n```\r\n\u002F\u002F determine temporary storage size\r\ncub::DeviceScan::ExclusiveSum(d_temp_storage, \r\n                              temp_storage_bytes, \r\n                              nullptr, ...);\r\n \r\n\u002F\u002F Allocate the required temporary storage\r\ncudaMallocAsync(&d_temp_storage,\r\n                temp_storage_bytes, stream);\r\n \r\n\u002F\u002F run the actual scan\r\ncub::DeviceScan::ExclusiveSum(d_temp_storage,\r\n                              temp_storage_bytes, \r\n                              d_input...);\r\n\r\n \u002F\u002F Free the temporary storage\r\ncudaFreeAsync(temp_storage, stream);\r\n```\r\n\r\nAfter\r\n```\r\n\u002F\u002F Pool mr uses cudaMallocAsync under the hood\r\ncuda::device_memory_pool mr{cuda::devices[0]};\r\n\r\n\u002F\u002F Single call. Temp storage is handled by the pool.\r\ncub::DeviceScan::ExclusiveSum(d_input,..., mr);\r\n```\r\n\r\n## What's Changed\r\n### 🚀 Thrust \u002F CUB\r\n- [par_nosync](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F4204) now uses async allocations by default.\r\n- New[ reduce_into](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fthrust\u002Fapi_docs\u002Falgorithms\u002Freductions.html) algorithm ([PR #4355](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F4355)).\r\n- Added[ strided_iterator](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fapi\u002Fclassstrided__iterator.html) ([PR #4014](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F4014)).\r\n- thrust::device_vector now supports[ default-init and skip-init constructors](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fthrust\u002Fapi\u002Fclassthrust_1_1device__vector.html) ([PR #4183](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F4183)).\r\n- New overloads for[ cub::WarpReduce](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fcub\u002Fapi\u002Fclasscub_1_1WarpReduce.html) ([PR #3884](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F3884)).\r\n- Tuned[ cub::ThreadReduce](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fcub\u002Fapi\u002Fnamespacecub_1a5bc54df3b7da4260d8c1d8579f63f0c0.html) ([PR #3441](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F3441)).\r\n\r\n### libcu++\r\n- Added host\u002Fdevice\u002Fmanaged[ mdspan](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fstandard_api\u002Fcontainer_library\u002Fmdspan.html) and[ accessors](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fextended_api\u002Fmdspan\u002Fhost_device_accessor.html).\r\n- New pointer utilities:[ is_aligned](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fextended_api\u002Fmemory\u002Fis_aligned.html),[ align_up](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fextended_api\u002Fmemory\u002Falign_up.html),[ align_down](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fextended_api\u002Fmemory\u002Falign_down.html),[ ptr_rebind](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fextended_api\u002Fmemory\u002Fptr_rebind.html).\r\n- New math utilities:[ ceil_ilog2](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fextended_api\u002Fmath\u002Filog.html),[ power-of-two helpers](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fextended_api\u002Fmath.html),[ fast_mod_div](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Flibcudacxx\u002Fextended_api\u002Fmath\u002Ffast_mod_div.html).\r\n- New PT","2025-10-14T22:04:54",{"id":272,"version":273,"summary_zh":274,"released_at":275},101113,"python-0.3.1","These are the release notes for the `cuda-cccl` Python package version 0.3.1, dated **October 8th, 2025**. The previous release was v0.3.0.\r\n\r\n`cuda.cccl` is in \"experimental\" status, meaning that its API and feature set can change quite rapidly.\r\n\r\n## Installation\r\n\r\nPlease refer to the install instructions [here](https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002Fsetup.html)\r\n\r\n## Features and improvements\r\n\r\n* The `cuda.cccl.parallel.experimental` package has been renamed to `cuda.compute`.\r\n* The `cuda.cccl.cooperative.experimental` package has been renamed to `cuda.coop`.\r\n* The old imports will continue to work for now, but will be removed in a subsequent release.\r\n* Documentation at https:\u002F\u002Fnvidia.github.io\u002Fcccl\u002Fpython\u002F has been updated to reflect these changes. \r\n\r\n## Bug Fixes\r\n\r\n-\r\n\r\n## Breaking Changes\r\n\r\n* If you previously were importing _subpackages_ of `cuda.cccl.parallel.experimental` or `cuda.cccl.cooperative.experimental`, those imports may not work as expected. Please import from `cuda.compute` and `cuda.coop` respectively.","2025-10-08T22:05:04",{"id":277,"version":278,"summary_zh":279,"released_at":280},101114,"v3.0.3","\u003C!-- Release notes generated using configuration in .github\u002Frelease.yml at v3.0.3 -->\r\n\r\n## What's Changed\r\n### 🔄 Other Changes\r\n* Backport #5442 to branch\u002F3.0x by @shwina in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F5469\r\n* Backport to 3.0: Fix grid dependency sync in cub::DeviceMergeSort (#5456) by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F5461\r\n* Partial backport to 3.0: Fix SMEM alignment in DeviceTransform by @bernhardmgruber in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F5463\r\n* [Version] Update branch\u002F3.0.x to v3.0.3 by @github-actions[bot] in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F5502\r\n* [Backport branch\u002F3.0.x] NV_TARGET and cuda::ptx for CTK 13 by @fbusato in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F5481\r\n* [BACKPORT 3.0]: Update PTX ISA version for CUDA 13 (#5676) by @miscco in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F5700\r\n* Backport some MSVC test fixes to 3.0 by @miscco in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F5819\r\n* [Backport 3.0]: Work around `submdspan` compiler issue on MSVC (#5885) by @miscco in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F5903\r\n* Backport pin of llvmlite dependency to branch\u002F3.0x by @shwina in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6000\r\n* [Backport branch\u002F3.0.x] Ensure that we are actually calling the cuda APIs ... (#4570) by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6098\r\n* [Backport to 3.0] add a specialization of `__make_tuple_types` for `complex\u003CT>` (#6102) by @davebayer in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6117\r\n* [Backport 3.0.x] Use proper qualification in allocate.h (#4796) by @wmaxey in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fpull\u002F6126\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fcompare\u002Fv3.0.2...v3.0.3","2025-10-07T00:34:37"]