[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NVIDIA--cutile-python":3,"tool-NVIDIA--cutile-python":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150037,2,"2026-04-10T23:33:47",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":100,"forks":101,"last_commit_at":102,"license":103,"difficulty_score":104,"env_os":105,"env_gpu":106,"env_ram":107,"env_deps":108,"category_tags":121,"github_topics":122,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":130,"updated_at":131,"faqs":132,"releases":160},6548,"NVIDIA\u002Fcutile-python","cutile-python","cuTile is a programming model for writing parallel kernels for NVIDIA GPUs","cuTile-python 是英伟达推出的一款专为 NVIDIA GPU 设计的并行计算编程模型，让开发者能够使用熟悉的 Python 语法高效编写高性能 GPU 内核。它主要解决了传统 GPU 开发中 C++\u002FCUDA 门槛高、代码复杂且难以维护的痛点，通过引入基于“瓦片（Tile）”的数据抽象机制，自动处理线程索引与内存加载存储细节，使并行逻辑更加直观清晰。\n\n该工具特别适合需要加速数值计算、深度学习或科学模拟的 Python 开发者及研究人员。用户只需定义数据分块大小，即可像操作普通数组一样进行并行运算，大幅降低了 GPU 并行编程的难度。其核心技术亮点在于基于 Tile IR 生成底层内核，并支持最新的 Blackwell 架构以及 Ampere\u002FAda 架构 GPU。cuTile-python 还能无缝集成 CuPy 等主流生态库，既保留了 Python 的开发效率，又能充分发挥硬件算力。对于希望在不深入学习底层 CUDA C++ 的前提下挖掘 GPU 潜力的技术团队而言，这是一个兼具易用性与高性能的理想选择。","\u003C!--- SPDX-FileCopyrightText: Copyright (c) \u003C2025> NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->\n\u003C!--- SPDX-License-Identifier: Apache-2.0 -->\n\ncuTile Python\n=============\n\ncuTile Python is a programming language for NVIDIA GPUs. The official documentation can be found\non [docs.nvidia.com](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcutile-python),\nor built from source located in the [docs](docs\u002F) folder.\n\n\nExample\n-------\n```python\n# This examples uses CuPy which can be installed via `pip install cupy-cuda13x`\n# Make sure cuda toolkit 13.1+ is installed: https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads\n\nimport cuda.tile as ct\nimport cupy\nimport numpy as np\n\nTILE_SIZE = 16\n\n# cuTile kernel for adding two dense vectors. It runs in parallel on the GPU.\n@ct.kernel\ndef vector_add_kernel(a, b, result):\n    block_id = ct.bid(0)\n    a_tile = ct.load(a, index=(block_id,), shape=(TILE_SIZE,))\n    b_tile = ct.load(b, index=(block_id,), shape=(TILE_SIZE,))\n    result_tile = a_tile + b_tile\n    ct.store(result, index=(block_id,), tile=result_tile)\n\n# Generate input arrays\nrng = cupy.random.default_rng()\na = rng.random(128)\nb = rng.random(128)\nexpected = cupy.asnumpy(a) + cupy.asnumpy(b)\n\n# Allocate an output array and launch the kernel\nresult = cupy.zeros_like(a)\ngrid = (ct.cdiv(a.shape[0], TILE_SIZE), 1, 1)\nct.launch(cupy.cuda.get_current_stream(), grid, vector_add_kernel, (a, b, result))\n\n# Verify the results\nresult_np = cupy.asnumpy(result)\nnp.testing.assert_array_almost_equal(result_np, expected)\n```\n\nMore examples can be found at [Samples](samples\u002F) and [TileGym](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTileGym).\n\nSystem Requirements\n-------------------\ncuTile Python generates kernels based on [Tile IR](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Ftile-ir\u002F)\nwhich requires NVIDIA Driver r580 or later to run.\nFurthermore, the `tileiras` compiler (version 13.2) only supports Blackwell GPU and Ampere\u002FAda\nGPU. Hopper GPU will be supported in the coming versions.\nCheckout the [prerequisites](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcutile-python\u002Fquickstart.html#prerequisites)\nfor full list of requirements.\n\n\nInstalling from PyPI\n--------------------\ncuTile Python is published on [PyPI](https:\u002F\u002Fpypi.org\u002F) under the\n[cuda-tile](https:\u002F\u002Fpypi.org\u002Fproject\u002Fcuda-tile\u002F) package name and can be installed with `pip`:\n```\npip install cuda-tile[tileiras]\n```\nThe optional `tileiras` dependency installs the `tileiras` compiler directly into your python\nenvironment.\n\n\nIf you do not want to have `tileiras` inside the python environment, run\n```\npip install cuda-tile\n```\nand install [CUDA Toolkit 13.1+](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads) separately.\n\nOn a Debian-based system, use `apt-get install cuda-tileiras-13.2\ncuda-compiler-13.2` instead of `apt-get install cuda-toolkit-13.2` if you wish\nto avoid installing the full CUDA Toolkit.\n\n\nBuilding from Source\n--------------------\ncuTile is written mostly in Python, but includes a C++ extension which needs to be built.\nYou will need:\n- A C++17-capable compiler, such as GNU C++ or MSVC;\n- CMake 3.18+;\n- GNU Make on Linux or msbuild on Windows;\n- Python 3.10+ with development headers (`venv` module is recommended but optional);\n- [CUDA Toolkit 13.1+](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads)\n\nOn an Ubuntu system, the first four dependencies can be installed with APT:\n```\nsudo apt-get update && sudo apt-get install build-essential cmake python3-dev python3-venv\n```\n\nThe CMakeLists.txt script will also automatically download\nthe [DLPack](https:\u002F\u002Fgithub.com\u002Fdmlc\u002Fdlpack) dependency from GitHub.\nIf you wish to disable this behavior and provide your own copy of DLPack,\nset the `CUDA_TILE_CMAKE_DLPACK_PATH` environment variable to a local path\nto the DLPack source tree.\n\nUnless you are already using a Python virtual environment, it is recommended to create one\nin order to avoid installing cuTile globally:\n\n```\npython3 -m venv env\nsource env\u002Fbin\u002Factivate\n```\n\nOnce the build dependencies are in place, the simplest way to build cuTile is to install it\nin editable mode by running the following command in the source root directory:\n\n```\npip install -e .\n```\n\nThis will create the `build` directory and invoke the CMake-based build process.\nIn editable mode, the compiled extension module will be placed in the build directory,\nand then a symbolic link to it will be created in the source directory.\nThis makes sure that the `pip install -e .` command above is needed only once, and recompiling\nthe extension after making changes to the C++ code can be done with `make -C build`\nwhich is much faster. This logic is defined in [setup.py](.\u002Fsetup.py).\n\nExperimental Features (Optional)\n--------------------------------\ncuTile now provides an experimental package containing APIs that are still under active development.\nThese are **not** part of the stable `cuda.tile` API and may change.\n\nTo enable the experimental features when working from a source checkout, install the experimental\npackage from the repository root:\n```\npip install .\u002Fexperimental\n```\n\nYou can also install it directly from a GitHub repository subdirectory:\n```\npip install \\\n  \"git+https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutile-python.git#egg=cuda-tile-experimental&subdirectory=experimental\"\n```\n\nFor example, this will make the experimental namespace available for autotuner:\n```\nfrom cuda.tile_experimental import autotune_launch, clear_autotune_cache\n```\n\nRunning Tests\n-------------\ncuTile uses the [pytest](https:\u002F\u002Fpytest.org) framework for testing.\nTests have extra dependencies, such as PyTorch, which can be installed with\n```\npip install -r test\u002Frequirements.txt\n```\n\nThe tests are located in the [test\u002F](test\u002F) directory. To run a specific test file,\nfor example `test_copy.py`, use the following command:\n```\npytest test\u002Ftest_copy.py\n```\n\nCopyright and License Information\n---------------------------------\nCopyright © 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n\ncuTile-Python is licensed under the Apache 2.0 license. See the [LICENSES](LICENSES\u002F) folder for the full license text.\n","\u003C!--- SPDX-FileCopyrightText: Copyright (c) \u003C2025> NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->\n\u003C!--- SPDX-License-Identifier: Apache-2.0 -->\n\ncuTile Python\n=============\n\ncuTile Python 是一种面向 NVIDIA GPU 的编程语言。官方文档可在 [docs.nvidia.com](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcutile-python) 上找到，也可以从位于 [docs](docs\u002F) 文件夹中的源代码构建。\n\n示例\n-------\n```python\n# 本示例使用 CuPy，可通过 `pip install cupy-cuda13x` 安装。\n# 请确保已安装 CUDA 工具包 13.1 或更高版本：https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads\n\nimport cuda.tile as ct\nimport cupy\nimport numpy as np\n\nTILE_SIZE = 16\n\n# cuTile 内核，用于对两个密集向量进行加法运算。它在 GPU 上并行执行。\n@ct.kernel\ndef vector_add_kernel(a, b, result):\n    block_id = ct.bid(0)\n    a_tile = ct.load(a, index=(block_id,), shape=(TILE_SIZE,))\n    b_tile = ct.load(b, index=(block_id,), shape=(TILE_SIZE,))\n    result_tile = a_tile + b_tile\n    ct.store(result, index=(block_id,), tile=result_tile)\n\n# 生成输入数组\nrng = cupy.random.default_rng()\na = rng.random(128)\nb = rng.random(128)\nexpected = cupy.asnumpy(a) + cupy.asnumpy(b)\n\n# 分配输出数组并启动内核\nresult = cupy.zeros_like(a)\ngrid = (ct.cdiv(a.shape[0], TILE_SIZE), 1, 1)\nct.launch(cupy.cuda.get_current_stream(), grid, vector_add_kernel, (a, b, result))\n\n# 验证结果\nresult_np = cupy.asnumpy(result)\nnp.testing.assert_array_almost_equal(result_np, expected)\n```\n\n更多示例可在 [Samples](samples\u002F) 和 [TileGym](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTileGym) 中找到。\n\n系统要求\n-------------------\ncuTile Python 基于 [Tile IR](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Ftile-ir\u002F) 生成内核，运行时需要 NVIDIA 驱动程序 r580 或更高版本。此外，`tileiras` 编译器（版本 13.2）仅支持 Blackwell GPU 以及 Ampere\u002FAda 架构的 GPU。Hopper 架构的 GPU 将在后续版本中得到支持。完整的依赖列表请参阅 [先决条件](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcutile-python\u002Fquickstart.html#prerequisites)。\n\n通过 PyPI 安装\n--------------------\ncuTile Python 已发布在 [PyPI](https:\u002F\u002Fpypi.org\u002F) 上，软件包名为 `cuda-tile`，可使用 `pip` 安装：\n```\npip install cuda-tile[tileiras]\n```\n可选的 `tileiras` 依赖会将 `tileiras` 编译器直接安装到您的 Python 环境中。\n\n如果您不希望在 Python 环境中包含 `tileiras`，可以运行以下命令：\n```\npip install cuda-tile\n```\n并单独安装 [CUDA 工具包 13.1+](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads)。\n\n在基于 Debian 的系统上，若希望避免安装完整的 CUDA 工具包，可使用 `apt-get install cuda-tileiras-13.2 cuda-compiler-13.2` 替代 `apt-get install cuda-toolkit-13.2`。\n\n从源代码构建\n--------------------\ncuTile 主要以 Python 编写，但也包含一个需要编译的 C++ 扩展模块。您需要：\n- 具备 C++17 支持的编译器，例如 GNU C++ 或 MSVC；\n- CMake 3.18 或更高版本；\n- Linux 上的 GNU Make 或 Windows 上的 msbuild；\n- Python 3.10 或更高版本，并带有开发头文件（推荐使用 `venv` 模块，但非强制）；\n- [CUDA 工具包 13.1+](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads)。\n\n在 Ubuntu 系统上，前四个依赖项可以通过 APT 安装：\n```\nsudo apt-get update && sudo apt-get install build-essential cmake python3-dev python3-venv\n```\n\nCMakeLists.txt 脚本还会自动从 GitHub 下载 [DLPack](https:\u002F\u002Fgithub.com\u002Fdmlc\u002Fdlpack) 依赖。如果您希望禁用此行为并提供自己的 DLPack 副本，请将 `CUDA_TILE_CMAKE_DLPACK_PATH` 环境变量设置为本地 DLPack 源码树的路径。\n\n除非您已经使用 Python 虚拟环境，否则建议创建一个虚拟环境，以避免全局安装 cuTile：\n```\npython3 -m venv env\nsource env\u002Fbin\u002Factivate\n```\n\n一旦构建依赖项就绪，构建 cuTile 的最简单方式是在源代码根目录下以可编辑模式安装：\n```\npip install -e .\n```\n\n这将会创建 `build` 目录，并调用基于 CMake 的构建流程。在可编辑模式下，编译后的扩展模块会被放置在 `build` 目录中，随后会在源代码目录中为其创建一个符号链接。这样可以确保只需执行一次上述 `pip install -e .` 命令，而在对 C++ 代码进行修改后，只需运行 `make -C build` 即可重新编译扩展模块，速度更快。这一逻辑定义在 [setup.py](.\u002Fsetup.py) 中。\n\n实验性功能（可选）\n--------------------------------\ncuTile 现在提供了一个实验性软件包，其中包含仍在积极开发中的 API。这些 API **不属于**稳定的 `cuda.tile` API，可能会发生变化。\n\n要在从源代码检出的工作环境中启用实验性功能，可以从仓库根目录安装实验性软件包：\n```\npip install .\u002Fexperimental\n```\n\n您也可以直接从 GitHub 仓库的子目录安装：\n```\npip install \\\n  \"git+https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutile-python.git#egg=cuda-tile-experimental&subdirectory=experimental\"\n```\n\n例如，这将使实验性命名空间可用于自动调优器：\n```\nfrom cuda.tile_experimental import autotune_launch, clear_autotune_cache\n```\n\n运行测试\n-------------\ncuTile 使用 [pytest](https:\u002F\u002Fpytest.org) 框架进行测试。测试有额外的依赖项，例如 PyTorch，可通过以下命令安装：\n```\npip install -r test\u002Frequirements.txt\n```\n\n测试文件位于 [test\u002F](test\u002F) 目录中。要运行特定的测试文件，例如 `test_copy.py`，可以使用以下命令：\n```\npytest test\u002Ftest_copy.py\n```\n\n版权与许可信息\n---------------------------------\n版权所有 © 2025 NVIDIA CORPORATION & AFFILIATES。保留所有权利。\n\ncuTile-Python 采用 Apache 2.0 许可证授权。完整许可证文本请参阅 [LICENSES](LICENSES\u002F) 文件夹。","# cuTile-Python 快速上手指南\n\ncuTile-Python 是专为 NVIDIA GPU 设计的编程语言，基于 Tile IR 生成高性能内核。本指南将帮助您快速配置环境并运行第一个示例。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下硬件和软件要求：\n\n*   **GPU 架构支持**：\n    *   目前支持 **Blackwell** 和 **Ampere\u002FAda** 架构的 GPU。\n    *   Hopper 架构将在后续版本中支持。\n*   **驱动程序**：必须安装 **NVIDIA Driver r580** 或更高版本。\n*   **CUDA Toolkit**：需要 **CUDA Toolkit 13.1+**。\n    *   下载地址：[NVIDIA CUDA Downloads](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads)\n    *   *Debian\u002FUbuntu 用户提示*：若不想安装完整 toolkit，可单独安装编译器组件：`apt-get install cuda-tileiras-13.2 cuda-compiler-13.2`。\n*   **Python 版本**：Python 3.10 或更高版本。\n*   **构建依赖**（仅源码编译时需要）：\n    *   C++17 编译器 (GCC 或 MSVC)\n    *   CMake 3.18+\n    *   Python 开发头文件 (`python3-dev`)\n\n## 安装步骤\n\n推荐使用 `pip` 从 PyPI 安装。您可以根据需求选择是否将编译器包含在 Python 环境中。\n\n### 方式一：标准安装（推荐）\n此方式会将 `tileiras` 编译器直接安装到当前 Python 环境中，配置最简单。\n\n```bash\npip install cuda-tile[tileiras]\n```\n\n### 方式二：最小化安装\n如果您已在全局路径配置了 CUDA Toolkit，或希望保持 Python 环境纯净，可选择此方式。\n\n```bash\npip install cuda-tile\n```\n*注意：使用此方式前，请确保系统已正确安装 CUDA Toolkit 13.1+ 且环境变量已配置。*\n\n### 国内加速建议\n在中国大陆地区，建议使用国内镜像源加速安装：\n\n```bash\npip install cuda-tile[tileiras] -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n\n以下是一个最简单的示例，演示如何使用 cuTile 编写一个并行向量加法内核。\n\n**前置依赖**：本示例依赖 `cupy`，请先安装：\n```bash\npip install cupy-cuda13x\n```\n\n**代码示例** (`example.py`)：\n\n```python\n# This examples uses CuPy which can be installed via `pip install cupy-cuda13x`\n# Make sure cuda toolkit 13.1+ is installed: https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads\n\nimport cuda.tile as ct\nimport cupy\nimport numpy as np\n\nTILE_SIZE = 16\n\n# cuTile kernel for adding two dense vectors. It runs in parallel on the GPU.\n@ct.kernel\ndef vector_add_kernel(a, b, result):\n    block_id = ct.bid(0)\n    a_tile = ct.load(a, index=(block_id,), shape=(TILE_SIZE,))\n    b_tile = ct.load(b, index=(block_id,), shape=(TILE_SIZE,))\n    result_tile = a_tile + b_tile\n    ct.store(result, index=(block_id,), tile=result_tile)\n\n# Generate input arrays\nrng = cupy.random.default_rng()\na = rng.random(128)\nb = rng.random(128)\nexpected = cupy.asnumpy(a) + cupy.asnumpy(b)\n\n# Allocate an output array and launch the kernel\nresult = cupy.zeros_like(a)\ngrid = (ct.cdiv(a.shape[0], TILE_SIZE), 1, 1)\nct.launch(cupy.cuda.get_current_stream(), grid, vector_add_kernel, (a, b, result))\n\n# Verify the results\nresult_np = cupy.asnumpy(result)\nnp.testing.assert_array_almost_equal(result_np, expected)\n\nprint(\"Vector addition completed successfully!\")\n```\n\n**运行测试**：\n确保 GPU 驱动和 CUDA 环境配置无误后，直接运行脚本：\n```bash\npython example.py\n```\n如果未抛出异常且输出成功信息，则表示环境配置正确，cuTile 内核已成功执行。","某高性能计算团队正在为新一代 Blackwell 架构 GPU 开发定制化的大规模矩阵运算内核，以加速深度学习模型的训练过程。\n\n### 没有 cutile-python 时\n- 开发者必须深入编写繁琐的 CUDA C++ 代码，手动管理线程块（Block）和线程（Thread）的索引映射，极易出错。\n- 针对新的 GPU 架构优化内存访问模式时，需要反复调整底层指令，开发周期长且对硬件细节依赖过重。\n- 算法原型验证困难，每次修改逻辑都需要重新编译整个 C++ 项目，无法利用 Python 生态快速迭代。\n- 团队中擅长算法但不懂底层 GPU 架构的科研人员难以直接参与内核优化，协作效率低下。\n\n### 使用 cutile-python 后\n- 研究人员可直接用纯 Python 语法定义并行内核，通过 `@ct.kernel` 装饰器和直观的 `load\u002Fstore` 接口自动处理线程调度。\n- 借助基于 Tile IR 的编程模型，轻松实现数据分块（Tiling）逻辑，天然适配 Blackwell 架构的高带宽内存特性。\n- 能够结合 CuPy 等现有 Python 库即时运行和调试内核，将算法从构思到验证的时间从数天缩短至几分钟。\n- 降低了 GPU 编程门槛，让算法专家能直接编写高性能算子，无需等待底层工程师翻译需求，显著提升协作流畅度。\n\ncutile-python 通过将复杂的 GPU 并行细节抽象为简洁的 Python 接口，让开发者能专注于算法逻辑本身，从而在新一代 NVIDIA GPU 上高效释放极致算力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_cutile-python_694d2feb.png","NVIDIA","NVIDIA Corporation","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNVIDIA_7dcf6000.png","",null,"https:\u002F\u002Fnvidia.com","https:\u002F\u002Fgithub.com\u002FNVIDIA",[80,84,88,92,96],{"name":81,"color":82,"percentage":83},"Python","#3572A5",91.3,{"name":85,"color":86,"percentage":87},"C++","#f34b7d",7.9,{"name":89,"color":90,"percentage":91},"CMake","#DA3434",0.5,{"name":93,"color":94,"percentage":95},"C","#555555",0.2,{"name":97,"color":98,"percentage":99},"Shell","#89e051",0.1,2014,130,"2026-04-10T16:26:19","NOASSERTION",4,"Linux, Windows","必需 NVIDIA GPU。支持 Blackwell 和 Ampere\u002FAda 架构（Hopper 将在未来版本支持）。需安装 NVIDIA 驱动 r580 或更高版本，以及 CUDA Toolkit 13.1+（编译器 tileiras 版本为 13.2）。未明确说明具体显存大小要求。","未说明",{"notes":109,"python":110,"dependencies":111},"1. 该工具主要用于生成基于 Tile IR 的 GPU 内核，必须使用较新的 NVIDIA 驱动 (r580+) 和特定架构的显卡 (Blackwell, Ampere, Ada)。\n2. 可通过 pip 安装预编译包，也可从源码构建（需要 C++ 编译环境和 CMake）。\n3. 若在 Debian 系统上不想安装完整的 CUDA Toolkit，可单独安装 cuda-tileiras-13.2 和 cuda-compiler-13.2。\n4. 包含实验性功能包 (cuda-tile-experimental)，需单独安装。\n5. 运行测试需要额外安装 PyTorch 等依赖。","3.10+",[112,113,114,115,116,117,118,119,120],"cuda-tile","tileiras (可选，版本 13.2)","cupy (示例依赖，需匹配 CUDA 版本如 cupy-cuda13x)","numpy","CMake 3.18+","C++17 编译器 (GNU C++ 或 MSVC)","DLPack (自动下载或手动提供)","pytest (测试依赖)","PyTorch (测试依赖)",[14],[123,124,125,126,127,128,129],"gpu","kernel","python","tile","cutile","tile-based-programming","parallel-kernels","2026-03-27T02:49:30.150509","2026-04-11T17:37:45.095541",[133,138,143,148,152,156],{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},29586,"在 CUDA 13.1 容器中运行 cuTile 时遇到 'PTX JIT compiler library not found' 错误如何解决？","这是因为缺少兼容性库。请执行以下步骤：\n1. 安装 `cuda-compat-13-1` 包。\n2. 将 `\u002Fusr\u002Flocal\u002Fcuda-13.1\u002Fcompat` 添加到 `LD_LIBRARY_PATH` 环境变量中。\n\n推荐的 Dockerfile 配置片段如下：\n```dockerfile\nENV LD_LIBRARY_PATH=\u002Fusr\u002Flocal\u002Fcuda-13.1\u002Fcompat:${CUDA_HOME}\u002Flib64:\u002Fusr\u002Flocal\u002Fnvidia\u002Flib:\u002Fusr\u002Flocal\u002Fnvidia\u002Flib64:${LD_LIBRARY_PATH}\n```\n此外，确保不需要单独安装 `nvidia-cuda-tileiras`，通常只需安装 `cuda-tile` 和 `cupy-cuda13x` 即可。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutile-python\u002Fissues\u002F44",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},29587,"cuTile 是否支持数组的索引和切片操作？其语义是返回副本还是视图？","是的，从 v1.1 版本开始支持数组切片。\n关于语义：cuTile 的核心原则是 \"tile\" 始终是一个值（value）。如果操作接收一个 tile 并返回一个 tile，结果是值的副本；如果操作从数组加载 tile，结果也是数组值的副本。\n但是，Array 对象本身代表内存引用。因此，返回 Array 的操作（如切片）遵循视图（view）语义，除非明确说明是复制。这意味着通过切片获得的子数组是原数组的视图，而非数据副本。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutile-python\u002Fissues\u002F34",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},29588,"在 Windows 上运行 cuTile 示例代码时出现 'Error loading nvrtc-builtins64_131.dll' 或 NVRTC 编译错误怎么办？","该问题通常与环境配置有关。注意以下几点：\n1. 如果您使用的是 Conda 环境，Conda 设计为自包含的用户空间包管理器，它会忽略系统已安装的包，因此必须安装所有必要的 NVIDIA CUDA Conda 依赖项以保持一致性。\n2. 确保您的环境中正确安装了与 CUDA 版本匹配的 NVRTC 组件。\n3. 由于 Windows 上原有的 `print_env.sh` 脚本无法工作，建议使用社区提供的 PowerShell 脚本来诊断环境问题（参考相关 PR #64）。\n4. 尝试重启开发工作站有时也能解决临时的 DLL 加载问题。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutile-python\u002Fissues\u002F57",{"id":149,"question_zh":150,"answer_zh":151,"source_url":142},29589,"在使用 cuTile 编写 GEMM 程序时，如何在 device 层级根据 block 索引动态切分矩阵？","在 v1.1 之前，cuTile 不支持非常量切片（Non-constant slices），导致无法直接使用变量进行索引（如 `x[start:end]`）。\n现在（v1.1+），您可以直接使用切片语法来获取数组的视图。例如：\n```python\nlocal_x = x[block_idx_x * tm : block_idx_x * tm + tm, :]\nlocal_y = y[block_idx_y * tn : block_idx_y * tn + tn, :]\n```\n这将创建原数组的视图（view），允许您将特定的数据块传递给后续的 kernel 函数进行处理，而无需手动传递整个大数组并在内部计算偏移。",{"id":153,"question_zh":154,"answer_zh":155,"source_url":137},29590,"构建 cuTile Docker 容器时，推荐的基础镜像和环境变量配置是什么？","推荐使用 `nvidia\u002Fcuda:13.1.0-devel-ubuntu22.04` 作为基础镜像。\n关键配置步骤包括：\n1. 设置 `CUDA_HOME=\u002Fusr\u002Flocal\u002Fcuda`。\n2. 更新 `PATH` 包含 `${CUDA_HOME}\u002Fbin`。\n3. 正确设置 `LD_LIBRARY_PATH`，务必包含兼容层路径（如果是 CUDA 13.1）：\n   `ENV LD_LIBRARY_PATH=\u002Fusr\u002Flocal\u002Fcuda-13.1\u002Fcompat:${CUDA_HOME}\u002Flib64:\u002Fusr\u002Flocal\u002Fnvidia\u002Flib:\u002Fusr\u002Flocal\u002Fnvidia\u002Flib64:${LD_LIBRARY_PATH}`\n4. 安装依赖时，通常只需 `pip install cuda-tile cupy-cuda13x`，避免多余安装 `nvidia-cuda-tileiras` 以防冲突。",{"id":157,"question_zh":158,"answer_zh":159,"source_url":142},29591,"cuTile 中 'tile' 和 'array' 的数据语义有什么区别？","两者的核心区别在于值语义与引用语义：\n- **Tile**: 始终代表一个具体的“值”。任何涉及 tile 的计算或加载操作（如 `ct.load`）都会产生数据的副本。这意味着修改一个 tile 不会影响原始数据源。\n- **Array**: 代表对内存的“引用”。对 Array 进行的切片或子区域选择操作默认返回的是“视图”（view），即指向同一块内存的不同窗口，而不是复制数据。\n理解这一点对于优化内存使用和避免意外的数据拷贝至关重要。",[]]