[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NVlabs--tiny-cuda-nn":3,"tool-NVlabs--tiny-cuda-nn":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",149489,2,"2026-04-10T11:32:46",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":104,"forks":105,"last_commit_at":106,"license":107,"difficulty_score":108,"env_os":109,"env_gpu":110,"env_ram":111,"env_deps":112,"category_tags":122,"github_topics":123,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":133,"updated_at":134,"faqs":135,"releases":168},6232,"NVlabs\u002Ftiny-cuda-nn","tiny-cuda-nn","Lightning fast C++\u002FCUDA neural network framework","tiny-cuda-nn 是一个轻量级且自包含的 C++\u002FCUDA 神经网络框架，专为在 NVIDIA GPU 上实现极速训练与推理而设计。它主要解决了传统深度学习框架在处理特定结构网络时效率不足的问题，通过高度优化的底层代码，显著提升了运算速度。\n\n该工具特别适合需要高性能计算的开发者、图形学研究人员以及从事神经渲染（如 Instant NGP）工作的工程师。如果你希望在资源受限的环境下快速原型验证，或追求极致的实时推理性能，tiny-cuda-nn 是理想选择。\n\n其核心技术亮点在于“完全融合”的多层感知机（Fully Fused MLP）和多分辨率哈希编码（Multiresolution Hash Encoding）。前者通过将多个神经网络层操作合并为单个 CUDA 内核，大幅减少内存访问开销；后者则能高效表示高频细节，常用于三维场景重建。此外，框架还支持即时编译融合（JIT fusion）技术，在较新的 GPU 上可进一步带来 1.5 至 2.5 倍的性能提升。整体而言，tiny-cuda-nn 以简洁的 API 和卓越的性能，成为加速小型神经网络应用的有力工具。","# Tiny CUDA Neural Networks ![](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fworkflows\u002FCI\u002Fbadge.svg)\n\nThis is a small, self-contained framework for training and querying neural networks. Most notably, it contains a lightning fast [\"fully fused\" multi-layer perceptron](https:\u002F\u002Fraw.githubusercontent.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fmaster\u002Fdata\u002Freadme\u002Ffully-fused-mlp-diagram.png) ([technical paper](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller21realtime\u002Fmueller21realtime.pdf)), a versatile [multiresolution hash encoding](https:\u002F\u002Fraw.githubusercontent.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fmaster\u002Fdata\u002Freadme\u002Fmultiresolution-hash-encoding-diagram.png) ([technical paper](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.pdf)), as well as support for various other input encodings, losses, and optimizers.\n\n## Performance\n\n![Image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_feb6d888d04f.png)\n_Fully fused networks vs. TensorFlow v2.5.0 w\u002F XLA. Measured on 64 (solid line) and 128 (dashed line) neurons wide multi-layer perceptrons on an RTX 3090. Generated by `benchmarks\u002Fbench_ours.cu` and `benchmarks\u002Fbench_tensorflow.py` using `data\u002Fconfig_oneblob.json`._\n\n\n## Usage\n\nTiny CUDA neural networks have a simple C++\u002FCUDA API:\n\n```cpp\n#include \u003Ctiny-cuda-nn\u002Fcommon.h>\n\n\u002F\u002F Configure the model\nnlohmann::json config = {\n\t{\"loss\", {\n\t\t{\"otype\", \"L2\"}\n\t}},\n\t{\"optimizer\", {\n\t\t{\"otype\", \"Adam\"},\n\t\t{\"learning_rate\", 1e-3},\n\t}},\n\t{\"encoding\", {\n\t\t{\"otype\", \"HashGrid\"},\n\t\t{\"n_levels\", 16},\n\t\t{\"n_features_per_level\", 2},\n\t\t{\"log2_hashmap_size\", 19},\n\t\t{\"base_resolution\", 16},\n\t\t{\"per_level_scale\", 2.0},\n\t}},\n\t{\"network\", {\n\t\t{\"otype\", \"FullyFusedMLP\"},\n\t\t{\"activation\", \"ReLU\"},\n\t\t{\"output_activation\", \"None\"},\n\t\t{\"n_neurons\", 64},\n\t\t{\"n_hidden_layers\", 2},\n\t}},\n};\n\nusing namespace tcnn;\n\nauto model = create_from_config(n_input_dims, n_output_dims, config);\nmodel->set_jit_fusion(supports_jit_fusion()); \u002F\u002F Optional: accelerate with JIT fusion\n\n\u002F\u002F Train the model (batch_size must be a multiple of tcnn::BATCH_SIZE_GRANULARITY)\nGPUMatrix\u003Cfloat> training_batch_inputs(n_input_dims, batch_size);\nGPUMatrix\u003Cfloat> training_batch_targets(n_output_dims, batch_size);\n\nfor (int i = 0; i \u003C n_training_steps; ++i) {\n\tgenerate_training_batch(&training_batch_inputs, &training_batch_targets); \u002F\u002F \u003C-- your code\n\n\tfloat loss;\n\tmodel.trainer->training_step(training_batch_inputs, training_batch_targets, &loss);\n\tstd::cout \u003C\u003C \"iteration=\" \u003C\u003C i \u003C\u003C \" loss=\" \u003C\u003C loss \u003C\u003C std::endl;\n}\n\n\u002F\u002F Use the model\nGPUMatrix\u003Cfloat> inference_inputs(n_input_dims, batch_size);\ngenerate_inputs(&inference_inputs); \u002F\u002F \u003C-- your code\n\nGPUMatrix\u003Cfloat> inference_outputs(n_output_dims, batch_size);\nmodel.network->inference(inference_inputs, inference_outputs);\n```\n\n## JIT fusion\n\nJIT fusion is a new, optional feature with tiny-cuda-nn v2.0 and later.\nIt is *almost always* recommended to enable [automatic JIT fusion](#automatic-jit-fusion) for a performance boost of 1.5x to 2.5x, depending on the model and GPU.\nNewer GPUs exhibit larger speedups.\n\nIf your model has very large hash grids (~20 million+ parameters) or MLPs (layer sizes larger than 128 neurons), or when your GPU is an RTX 3000 series or earlier, JIT fusion *can* slow down training.\nRarely inference, too.\nIt this case, it is recommended to try enabling JIT fusion separately for training and inference to measure whether it is faster.\n\nPlease [open an issue](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fissues) if you encounter a slowdown in a different situation or other problems with JIT fusion enabled.\n\n### Automatic JIT fusion\n\nTo enable JIT fusion, set the `jit_fusion` property of your model to `true`.\nAll future uses of the model, whether inference or training, will then use JIT mode.\nNote that if there is an error during JIT compilation, a warning will be emitted and JIT compilation mode automatically turned off.\nYour code will still run using the tiny-cuda-nn 1.X code path.\n\n```cpp\nauto model = tcnn::create_from_config(...);\nmodel->set_jit_fusion(tcnn::supports_jit_fusion()); \u002F\u002F Enable JIT if the system supports it\n```\n\nJIT fusion can also be enabled via the PyTorch bindings but the speed-up will be lower, particularly during training.\nThis is because the JIT compiler does not have access to the whole compute graph and can therefore fuse and optimize less.\n\n```python\nimport tinycudann as tcnn\n\nmodel = tcnn.NetworkWithInputEncoding(...) # Or any other tcnn model\nmodel.jit_fusion = tcnn.supports_jit_fusion() # Enable JIT if the system supports it\n```\n\n### Manual JIT fusion\n\nEven larger speed-ups are possible when applications integrate more tightly with JIT fusion.\nFor example, [Instant NGP](https:\u002F\u002Fgithub.com\u002Fnvlabs\u002Finstant-ngp) achieves a 5x speedup by fusing the entire NeRF ray marcher into a single kernel.\n\nJIT fusion works by converting a given tiny-cuda-nn model to a CUDA device function and then compiling it into a kernel using CUDA's runtime compilation (RTC) feature.\n\nTo integrate a tiny-cuda-nn model with a larger kernel in your app, you need to\n1. turn your kernel into a string,\n2. prepend the tiny-cuda-nn model's device function,\n3. pass the result to tiny-cuda-nn's runtime compilation API.\n\nHere is an example that implements a minimal kernel using a tiny-cuda-nn model with 32 input dimensions and 16 output dimensions:\n```cpp\n#include \u003Ctiny-cuda-nn\u002Frtc_kernel.h>\n\nauto model = tcnn::create_from_config(32 \u002F* input dims *\u002F, 16 \u002F* output dims *\u002F, ...);\nauto fused_kernel = tcnn::CudaRtcKernel(\n    \"your_kernel\",\n    fmt::format(R\"\n        {MODEL_DEVICE_FUNCTION}\n        __global__ void your_kernel(...) {\n            \u002F\u002F Get input to model from either registers or memory.\n            tcnn::hvec\u003C32> input = ...;\n            \u002F\u002F Call tiny-cuda-nn model. All 32 threads of the warp must be active here.\n            tcnn::hvec\u003C16> output = model_fun(nerf_in, params); \n            \u002F\u002F Do something with the model output.\n        }\",\n        fmt::arg(\"MODEL_DEVICE_FUNCTION\", model->generate_device_function(\"model_fun\")),\n    )\n);\n\nuint32_t blocks = 1;\nuint32_t threads = 128; \u002F\u002F Must be multiple of 32 for neural networks to work.\nuint32_t shmem_size = 0; \u002F\u002F Can be any size that your_kernel needs.\ncudaStream_t stream = nullptr; \u002F\u002F Can be any stream.\nfused_kernel.launch(blocks, threads, shmem_size, stream, ... \u002F* params of your_kernel *\u002F);\n```\n\nAnd here is Instant NGP's NeRF integration with the JIT compiler for reference:\n- [src\u002Ftestbed_nerf.cu](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Finstant-ngp\u002Fblob\u002Fd6bbefb0b68e6322711b518eac7f9ab4c1cc7b1e\u002Fsrc\u002Ftestbed_nerf.cu#L1931)\n- [include\u002Fneural-graphics-primitives\u002Ffused_kernels\u002Frender_nerf.cuh](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Finstant-ngp\u002Fblob\u002Fmaster\u002Finclude\u002Fneural-graphics-primitives\u002Ffused_kernels\u002Frender_nerf.cuh)\n\n\n## Example: learning a 2D image\n\nWe provide a sample application where an image function _(x,y) -> (R,G,B)_ is learned. It can be run via\n```sh\ntiny-cuda-nn$ .\u002Fbuild\u002Fmlp_learning_an_image https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_974957d202f6.jpg data\u002Fconfig_hash.json\n```\nproducing an image every couple of training steps. Each 1000 steps should take a bit over 1 second with the default configuration on an RTX 4090.\n\n| 10 steps | 100 steps | 1000 steps | Reference image |\n|:---:|:---:|:---:|:---:|\n| ![10steps](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_f01b08571aad.jpg) | ![100steps](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_4147a14635a6.jpg) | ![1000steps](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_819f9f9d1bc2.jpg) | ![reference](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_974957d202f6.jpg) |\n\n\n\n## Requirements\n\n- An __NVIDIA GPU__; tensor cores increase performance when available. All shown results come from an RTX 3090.\n- A __C++17__ capable compiler. The following choices are recommended and have been tested:\n  - __Windows:__ Visual Studio 2019 or 2022\n  - __Linux:__ GCC\u002FG++ 8 or higher\n- A recent version of __[CUDA](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit)__. The following choices are recommended and have been tested:\n  - __Windows:__ CUDA 11.5 or higher\n  - __Linux:__ CUDA 10.2 or higher\n- __[CMake](https:\u002F\u002Fcmake.org\u002F) v3.21 or higher__.\n- The fully fused MLP component of this framework requires a __very large__ amount of shared memory in its default configuration. It will likely only work on an RTX 3090, an RTX 2080 Ti, or higher-end GPUs. Lower end cards must reduce the `n_neurons` parameter or use the `CutlassMLP` (better compatibility but slower) instead.\n\nIf you are using Linux, install the following packages\n```sh\nsudo apt-get install build-essential git\n```\n\nWe also recommend installing [CUDA](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit) in `\u002Fusr\u002Flocal\u002F` and adding the CUDA installation to your PATH.\nFor example, if you have CUDA 12.6.3, add the following to your `~\u002F.bashrc`\n```sh\nexport PATH=\"\u002Fusr\u002Flocal\u002Fcuda-12.6.3\u002Fbin:$PATH\"\nexport LD_LIBRARY_PATH=\"\u002Fusr\u002Flocal\u002Fcuda-12.6.3\u002Flib64:$LD_LIBRARY_PATH\"\n```\n\n\n## Compilation (Windows & Linux)\n\nBegin by cloning this repository and all its submodules using the following command:\n```sh\n$ git clone --recursive https:\u002F\u002Fgithub.com\u002Fnvlabs\u002Ftiny-cuda-nn\n$ cd tiny-cuda-nn\n```\n\nThen, use CMake to build the project: (on Windows, this must be in a [developer command prompt](https:\u002F\u002Fdocs.microsoft.com\u002Fen-us\u002Fcpp\u002Fbuild\u002Fbuilding-on-the-command-line?view=msvc-160#developer_command_prompt))\n```sh\ntiny-cuda-nn$ cmake . -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo\ntiny-cuda-nn$ cmake --build build --config RelWithDebInfo -j\n```\n\nIf compilation fails inexplicably or takes longer than an hour, you might be running out of memory. Try running the above command without `-j` in that case.\n\n\n## PyTorch extension\n\n__tiny-cuda-nn__ comes with a [PyTorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch) extension that allows using the fast MLPs and input encodings from within a [Python](https:\u002F\u002Fwww.python.org\u002F) context.\nThese bindings can be significantly faster than full Python implementations; in particular for the [multiresolution hash encoding](https:\u002F\u002Fraw.githubusercontent.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fmaster\u002Fdata\u002Freadme\u002Fmultiresolution-hash-encoding-diagram.png).\n\n> The overheads of Python\u002FPyTorch can nonetheless be extensive if the batch size is small.\n> For example, with a batch size of 64k, the bundled `mlp_learning_an_image` example is __~2x slower__ through PyTorch than native CUDA.\n> With a batch size of 256k and higher (default), the performance is much closer.\n\nBegin by setting up a Python 3.X environment with a recent, CUDA-enabled version of PyTorch. Then, invoke\n```sh\npip install git+https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002F#subdirectory=bindings\u002Ftorch\n```\n\nAlternatively, if you would like to install from a local clone of __tiny-cuda-nn__, invoke\n```sh\ntiny-cuda-nn$ cd bindings\u002Ftorch\ntiny-cuda-nn\u002Fbindings\u002Ftorch$ python setup.py install\n```\n\nBy default, the extension automatically enables half precision (FP16) on GPUs with good support (Volta, Turing, Ampere, etc.) and disables it on older architectures or those with slow FP16 (e.g., Pascal\u002FGTX 10-series).\n\nIf you wish to override this behavior (e.g., to force FP16 on unsupported hardware or disable it for debugging), set the TCNN_HALF_PRECISION environment variable before installation:\n\nDisable FP16: 0\nEnable FP16: 1\n\nExample:\n```sh\n# Linux \u002F macOS (Disable FP16)\nexport TCNN_HALF_PRECISION=0\npip install git+https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002F#subdirectory=bindings\u002Ftorch\n```\n\nUpon success, you can use __tiny-cuda-nn__ models as in the following example:\n```py\nimport commentjson as json\nimport tinycudann as tcnn\nimport torch\n\nwith open(\"data\u002Fconfig_hash.json\") as f:\n\tconfig = json.load(f)\n\n# Option 1: efficient Encoding+Network combo.\nmodel = tcnn.NetworkWithInputEncoding(\n\tn_input_dims, n_output_dims,\n\tconfig[\"encoding\"], config[\"network\"]\n)\n\n# Option 2: separate modules. Slower but more flexible.\nencoding = tcnn.Encoding(n_input_dims, config[\"encoding\"])\nnetwork = tcnn.Network(encoding.n_output_dims, n_output_dims, config[\"network\"])\nmodel = torch.nn.Sequential(encoding, network)\n\nmodel.jit_fusion = tcnn.supports_jit_fusion() # Optional: accelerate with JIT fusion\n```\n\nSee `samples\u002Fmlp_learning_an_image_pytorch.py` for an example.\n\n\n\n## Components\n\nFollowing is a summary of the components of this framework. [The JSON documentation](DOCUMENTATION.md) lists configuration options.\n\n\n| Networks | &nbsp; | &nbsp;\n| :--- | :---------- | :-----\n| Fully fused MLP | `src\u002Ffully_fused_mlp.cu` | Lightning fast implementation of small multi-layer perceptrons (MLPs).\n| CUTLASS MLP     | `src\u002Fcutlass_mlp.cu`     | MLP based on [CUTLASS](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass)' GEMM routines. Slower than fully-fused, but handles larger networks and still is reasonably fast.\n\n| Input encodings | &nbsp; | &nbsp;\n| :--- | :---------- | :-----\n| Composite | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Fcomposite.h` | Allows composing multiple encodings. Can be, for example, used to assemble the Neural Radiance Caching encoding [[Müller et al. 2021]](https:\u002F\u002Ftom94.net\u002F).\n| Frequency | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Ffrequency.h` | NeRF's [[Mildenhall et al. 2020]](https:\u002F\u002Fwww.matthewtancik.com\u002Fnerf) positional encoding applied equally to all dimensions.\n| Grid | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Fgrid.h` | Encoding based on trainable multiresolution grids. Used for [Instant Neural Graphics Primitives [Müller et al. 2022]](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002F). The grids can be backed by hashtables, dense storage, or tiled storage.\n| Identity | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Fidentity.h` | Leaves values untouched.\n| Oneblob | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Foneblob.h` | From Neural Importance Sampling [[Müller et al. 2019]](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller18neural\u002Fmueller18neural-v4.pdf) and Neural Control Variates [[Müller et al. 2020]](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller20neural\u002Fmueller20neural.pdf).\n| SphericalHarmonics | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Fspherical_harmonics.h` | A frequency-space encoding that is more suitable to direction vectors than component-wise ones.\n| TriangleWave | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Ftriangle_wave.h` | Low-cost alternative to the NeRF's encoding. Used in Neural Radiance Caching [[Müller et al. 2021]](https:\u002F\u002Ftom94.net\u002F).\n\n| Losses | &nbsp; | &nbsp;\n| :--- | :---------- | :-----\n| L1 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fl1.h` | Standard L1 loss.\n| Relative L1 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fl1.h` | Relative L1 loss normalized by the network prediction.\n| MAPE | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fmape.h` | Mean absolute percentage error (MAPE). The same as Relative L1, but normalized by the target.\n| SMAPE | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fsmape.h` | Symmetric mean absolute percentage error (SMAPE). The same as Relative L1, but normalized by the mean of the prediction and the target.\n| L2 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fl2.h` | Standard L2 loss.\n| Relative L2 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Frelative_l2.h` | Relative L2 loss normalized by the network prediction [[Lehtinen et al. 2018]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fnoise2noise).\n| Relative L2 Luminance | `include\u002Ftiny-cuda-nn\u002Flosses\u002Frelative_l2_luminance.h` | Same as above, but normalized by the luminance of the network prediction. Only applicable when network prediction is RGB. Used in Neural Radiance Caching [[Müller et al. 2021]](https:\u002F\u002Ftom94.net\u002F).\n| Cross Entropy | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fcross_entropy.h` | Standard cross entropy loss. Only applicable when the network prediction is a PDF.\n| Variance | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fvariance_is.h` | Standard variance loss. Only applicable when the network prediction is a PDF.\n\n| Optimizers | &nbsp; | &nbsp;\n| :--- | :---------- | :-----\n| Adam | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fadam.h` | Implementation of Adam [[Kingma and Ba 2014]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1412.6980), generalized to AdaBound [[Luo et al. 2019]](https:\u002F\u002Fgithub.com\u002FLuolc\u002FAdaBound).\n| Novograd | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Flookahead.h` | Implementation of Novograd [[Ginsburg et al. 2019]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1905.11286).\n| SGD | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fsgd.h` | Standard stochastic gradient descent (SGD).\n| Shampoo | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fshampoo.h` | Implementation of the 2nd order Shampoo optimizer [[Gupta et al. 2018]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.09568) with home-grown optimizations as well as those by [Anil et al. [2020]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.09018).\n| Average | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Faverage.h` | Wraps another optimizer and computes a linear average of the weights over the last N iterations. The average is used for inference only (does not feed back into training).\n| Batched | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fbatched.h` | Wraps another optimizer, invoking the nested optimizer once every N steps on the averaged gradient. Has the same effect as increasing the batch size but requires only a constant amount of memory. |\n| Composite | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fcomposite.h` | Allows using several optimizers on different parameters.\n| EMA | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Faverage.h` | Wraps another optimizer and computes an exponential moving average of the weights. The average is used for inference only (does not feed back into training).\n| Exponential Decay | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fexponential_decay.h` | Wraps another optimizer and performs piecewise-constant exponential learning-rate decay.\n| Lookahead | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Flookahead.h` | Wraps another optimizer, implementing the lookahead algorithm [[Zhang et al. 2019]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1907.08610).\n\n\n## License and Citation\n\nThis framework is licensed under the BSD 3-clause license. Please see `LICENSE.txt` for details.\n\nIf you use it in your research, we would appreciate a citation via\n```bibtex\n@software{tiny-cuda-nn,\n\tauthor = {M\\\"uller, Thomas},\n\tlicense = {BSD-3-Clause},\n\tmonth = {4},\n\ttitle = {{tiny-cuda-nn}},\n\turl = {https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn},\n\tversion = {2.0},\n\tyear = {2021}\n}\n```\n\nFor business inquiries, please visit our website and submit the form: [NVIDIA Research Licensing](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fresearch\u002Finquiries\u002F)\n\n\n## Publications & Software\n\nAmong others, this framework powers the following publications:\n\n> __Instant Neural Graphics Primitives with a Multiresolution Hash Encoding__  \n> [Thomas Müller](https:\u002F\u002Ftom94.net), [Alex Evans](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Falex-evans), [Christoph Schied](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Fchristoph-schied), [Alexander Keller](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Falex-keller)  \n> _ACM Transactions on Graphics (__SIGGRAPH__), July 2022_  \n> __[Website](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002F)&nbsp;\u002F [Paper](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.pdf)&nbsp;\u002F [Code](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Finstant-ngp)&nbsp;\u002F [Video](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.mp4)&nbsp;\u002F [BibTeX](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.bib)__\n\n> __Extracting Triangular 3D Models, Materials, and Lighting From Images__  \n> [Jacob Munkberg](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Fjacob-munkberg), [Jon Hasselgren](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Fjon-hasselgren), [Tianchang Shen](http:\u002F\u002Fwww.cs.toronto.edu\u002F~shenti11\u002F), [Jun Gao](http:\u002F\u002Fwww.cs.toronto.edu\u002F~jungao\u002F), [Wenzheng Chen](http:\u002F\u002Fwww.cs.toronto.edu\u002F~wenzheng\u002F), [Alex Evans](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Falex-evans), [Thomas Müller](https:\u002F\u002Ftom94.net), [Sanja Fidler](https:\u002F\u002Fwww.cs.toronto.edu\u002F~fidler\u002F)  \n> __CVPR (Oral)__, June 2022  \n> __[Website](https:\u002F\u002Fnvlabs.github.io\u002Fnvdiffrec\u002F)&nbsp;\u002F [Paper](https:\u002F\u002Fnvlabs.github.io\u002Fnvdiffrec\u002Fassets\u002Fpaper.pdf)&nbsp;\u002F [Video](https:\u002F\u002Fnvlabs.github.io\u002Fnvdiffrec\u002Fassets\u002Fvideo.mp4)&nbsp;\u002F [BibTeX](https:\u002F\u002Fnvlabs.github.io\u002Fnvdiffrec\u002Fassets\u002Fbib.txt)__\n\n> __Real-time Neural Radiance Caching for Path Tracing__  \n> [Thomas Müller](https:\u002F\u002Ftom94.net), [Fabrice Rousselle](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Ffabrice-rousselle), [Jan Novák](http:\u002F\u002Fjannovak.info), [Alexander Keller](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Falex-keller)  \n> _ACM Transactions on Graphics (__SIGGRAPH__), August 2021_  \n> __[Paper](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller21realtime\u002Fmueller21realtime.pdf)&nbsp;\u002F [GTC talk](https:\u002F\u002Fgtc21.event.nvidia.com\u002Fmedia\u002FFully%20Fused%20Neural%20Network%20for%20Radiance%20Caching%20in%20Real%20Time%20Rendering%20%5BE31307%5D\u002F1_liqy6k1c)&nbsp;\u002F [Video](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller21realtime\u002Fmueller21realtime.mp4)&nbsp;\u002F [Interactive results viewer](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller21realtime\u002Finteractive-viewer\u002F)&nbsp;\u002F [BibTeX](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller21realtime\u002Fmueller21realtime.bib)__\n\n\nAs well as the following software:\n\n> __NerfAcc: A General NeRF Accleration Toolbox__  \n> [Ruilong Li](https:\u002F\u002Fwww.liruilong.cn\u002F), [Matthew Tancik](https:\u002F\u002Fwww.matthewtancik.com\u002Fabout-me), [Angjoo Kanazawa](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~kanazawa\u002F)  \n> __https:\u002F\u002Fgithub.com\u002FKAIR-BAIR\u002Fnerfacc__\n\n> __Nerfstudio: A Framework for Neural Radiance Field Development__  \n> [Matthew Tancik*](https:\u002F\u002Fwww.matthewtancik.com\u002Fabout-me), [Ethan Weber*](https:\u002F\u002Fethanweber.me\u002F), [Evonne Ng*](http:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~evonne_ng\u002F), [Ruilong Li](https:\u002F\u002Fwww.liruilong.cn\u002F), Brent Yi, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David McAllister, [Angjoo Kanazawa](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~kanazawa\u002F)  \n> __https:\u002F\u002Fgithub.com\u002Fnerfstudio-project\u002Fnerfstudio__\n\nPlease feel free to make a pull request if your publication or software is not listed.\n\n## Acknowledgments\n\nSpecial thanks go to the NRC authors for helpful discussions and to [Nikolaus Binder](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Fnikolaus-binder) for providing part of the infrastructure of this framework, as well as for help with utilizing TensorCores from within CUDA.\n","# 微型 CUDA 神经网络 ![](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fworkflows\u002FCI\u002Fbadge.svg)\n\n这是一个小型、自包含的框架，用于训练和查询神经网络。最引人注目的是，它包含一个极快的“完全融合”多层感知机（[技术论文](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller21realtime\u002Fmueller21realtime.pdf)），一种多功能的[多分辨率哈希编码](https:\u002F\u002Fraw.githubusercontent.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fmaster\u002Fdata\u002Freadme\u002Fmultiresolution-hash-encoding-diagram.png)（[技术论文](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.pdf)），以及对各种其他输入编码、损失函数和优化器的支持。\n\n## 性能\n\n![Image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_feb6d888d04f.png)\n_完全融合网络与 TensorFlow v2.5.0 w\u002F XLA 的对比。在 RTX 3090 上，分别测量了宽度为 64（实线）和 128（虚线）个神经元的多层感知机。由 `benchmarks\u002Fbench_ours.cu` 和 `benchmarks\u002Fbench_tensorflow.py` 使用 `data\u002Fconfig_oneblob.json` 生成。_\n\n\n## 使用方法\n\n微型 CUDA 神经网络提供了一个简单的 C++\u002FCUDA API：\n\n```cpp\n#include \u003Ctiny-cuda-nn\u002Fcommon.h>\n\n\u002F\u002F 配置模型\nnlohmann::json config = {\n\t{\"loss\", {\n\t\t{\"otype\", \"L2\"}\n\t}},\n\t{\"optimizer\", {\n\t\t{\"otype\", \"Adam\"},\n\t\t{\"learning_rate\", 1e-3},\n\t}},\n\t{\"encoding\", {\n\t\t{\"otype\", \"HashGrid\"},\n\t\t{\"n_levels\", 16},\n\t\t{\"n_features_per_level\", 2},\n\t\t{\"log2_hashmap_size\", 19},\n\t\t{\"base_resolution\", 16},\n\t\t{\"per_level_scale\", 2.0},\n\t}},\n\t{\"network\", {\n\t\t{\"otype\", \"FullyFusedMLP\"},\n\t\t{\"activation\", \"ReLU\"},\n\t\t{\"output_activation\", \"None\"},\n\t\t{\"n_neurons\", 64},\n\t\t{\"n_hidden_layers\", 2},\n\t}},\n};\n\nusing namespace tcnn;\n\nauto model = create_from_config(n_input_dims, n_output_dims, config);\nmodel->set_jit_fusion(supports_jit_fusion()); \u002F\u002F 可选：通过 JIT 融合加速\n\n\u002F\u002F 训练模型（batch_size 必须是 tcnn::BATCH_SIZE_GRANULARITY 的倍数）\nGPUMatrix\u003Cfloat> training_batch_inputs(n_input_dims, batch_size);\nGPUMatrix\u003Cfloat> training_batch_targets(n_output_dims, batch_size);\n\nfor (int i = 0; i \u003C n_training_steps; ++i) {\n\tgenerate_training_batch(&training_batch_inputs, &training_batch_targets); \u002F\u002F \u003C-- 你的代码\n\n\tfloat loss;\n\tmodel.trainer->training_step(training_batch_inputs, training_batch_targets, &loss);\n\tstd::cout \u003C\u003C \"iteration=\" \u003C\u003C i \u003C\u003C \" loss=\" \u003C\u003C loss \u003C\u003C std::endl;\n}\n\n\u002F\u002F 使用模型\nGPUMatrix\u003Cfloat> inference_inputs(n_input_dims, batch_size);\ngenerate_inputs(&inference_inputs); \u002F\u002F \u003C-- 你的代码\n\nGPUMatrix\u003Cfloat> inference_outputs(n_output_dims, batch_size);\nmodel.network->inference(inference_inputs, inference_outputs);\n```\n\n## JIT 融合\n\nJIT 融合是 tiny-cuda-nn v2.0 及更高版本中的一项新功能，属于可选特性。\n根据模型和 GPU 的不同，*几乎总是*建议启用[自动 JIT 融合](#automatic-jit-fusion)，以获得 1.5 到 2.5 倍的性能提升。\n较新的 GPU 通常会带来更大的加速效果。\n\n如果您的模型包含非常大的哈希网格（约 2000 万+ 参数）或 MLP（每层神经元数量超过 128 个），或者您的 GPU 是 RTX 3000 系列及更早型号，则 JIT 融合*可能会*减慢训练速度。\n在极少数情况下，推理也会变慢。\n在这种情况下，建议分别尝试为训练和推理启用 JIT 融合，以衡量是否确实更快。\n\n如果您在其他情况下遇到性能下降，或在启用 JIT 融合时遇到其他问题，请[提交一个问题](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fissues)。\n\n### 自动 JIT 融合\n\n要启用 JIT 融合，只需将模型的 `jit_fusion` 属性设置为 `true`。\n此后，无论进行推理还是训练，模型都将使用 JIT 模式。\n请注意，如果 JIT 编译过程中出现错误，系统将发出警告，并自动关闭 JIT 编译模式。\n此时，您的代码仍将使用 tiny-cuda-nn 1.X 的代码路径运行。\n\n```cpp\nauto model = tcnn::create_from_config(...);\nmodel->set_jit_fusion(tcnn::supports_jit_fusion()); \u002F\u002F 如果系统支持，则启用 JIT\n```\n\nJIT 融合也可以通过 PyTorch 绑定来启用，但加速效果会较低，尤其是在训练阶段。\n这是因为 JIT 编译器无法访问完整的计算图，因此能够融合和优化的内容较少。\n\n```python\nimport tinycudann as tcnn\n\nmodel = tcnn.NetworkWithInputEncoding(...) # 或任何其他 tcnn 模型\nmodel.jit_fusion = tcnn.supports_jit_fusion() # 如果系统支持，则启用 JIT\n```\n\n### 手动 JIT 融合\n\n当应用程序与 JIT 融合更紧密地集成时，可以获得更大的加速效果。\n例如，[Instant NGP](https:\u002F\u002Fgithub.com\u002Fnvlabs\u002Finstant-ngp) 通过将整个 NeRF 射线追踪器融合到一个内核中，实现了 5 倍的加速。\n\nJIT 融合的工作原理是将给定的 tiny-cuda-nn 模型转换为 CUDA 设备函数，然后利用 CUDA 的运行时编译 (RTC) 功能将其编译成一个内核。\n\n要将 tiny-cuda-nn 模型与您应用中的更大内核集成，您需要：\n1. 将您的内核转换为字符串，\n2. 在其前缀添加 tiny-cuda-nn 模型的设备函数，\n3. 将结果传递给 tiny-cuda-nn 的运行时编译 API。\n\n以下是一个示例，展示了如何使用具有 32 个输入维度和 16 个输出维度的 tiny-cuda-nn 模型实现一个最小内核：\n```cpp\n#include \u003Ctiny-cuda-nn\u002Frtc_kernel.h>\n\nauto model = tcnn::create_from_config(32 \u002F* input dims *\u002F, 16 \u002F* output dims *\u002F, ...);\nauto fused_kernel = tcnn::CudaRtcKernel(\n    \"your_kernel\",\n    fmt::format(R\"\n        {MODEL_DEVICE_FUNCTION}\n        __global__ void your_kernel(...) {\n            \u002F\u002F 从寄存器或内存中获取模型输入。\n            tcnn::hvec\u003C32> input = ...;\n            \u002F\u002F 调用 tiny-cuda-nn 模型。在此处，warp 中的所有 32 个线程都必须处于活动状态。\n            tcnn::hvec\u003C16> output = model_fun(nerf_in, params); \n            \u002F\u002F 对模型输出做些处理。\n        }\",\n        fmt::arg(\"MODEL_DEVICE_FUNCTION\", model->generate_device_function(\"model_fun\")),\n    )\n);\n\nuint32_t blocks = 1;\nuint32_t threads = 128; \u002F\u002F 必须是 32 的倍数，以便神经网络正常工作。\nuint32_t shmem_size = 0; \u002F\u002F 可以根据 your_kernel 的需求设置任意大小。\ncudaStream_t stream = nullptr; \u002F\u002F 可以使用任意流。\nfused_kernel.launch(blocks, threads, shmem_size, stream, ... \u002F* your_kernel 的参数 *\u002F);\n```\n\n以下是 Instant NGP 的 NeRF 集成与 JIT 编译器的参考：\n- [src\u002Ftestbed_nerf.cu](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Finstant-ngp\u002Fblob\u002Fd6bbefb0b68e6322711b518eac7f9ab4c1cc7b1e\u002Fsrc\u002Ftestbed_nerf.cu#L1931)\n- [include\u002Fneural-graphics-primitives\u002Ffused_kernels\u002Frender_nerf.cuh](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Finstant-ngp\u002Fblob\u002Fmaster\u002Finclude\u002Fneural-graphics-primitives\u002Ffused_kernels\u002Frender_nerf.cuh)\n\n## 示例：学习一张2D图像\n\n我们提供了一个示例应用，用于学习一个图像函数 _(x,y) -> (R,G,B)_。可以通过以下命令运行：\n```sh\ntiny-cuda-nn$ .\u002Fbuild\u002Fmlp_learning_an_image https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_974957d202f6.jpg data\u002Fconfig_hash.json\n```\n该程序会在每隔几个训练步骤生成一张图像。在 RTX 4090 上使用默认配置时，每 1000 步大约需要 1 秒多一点。\n\n| 10 步 | 100 步 | 1000 步 | 参考图像 |\n|:---:|:---:|:---:|:---:|\n| ![10steps](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_f01b08571aad.jpg) | ![100steps](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_4147a14635a6.jpg) | ![1000steps](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_819f9f9d1bc2.jpg) | ![reference](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_readme_974957d202f6.jpg) |\n\n\n\n## 要求\n\n- 一块 __NVIDIA GPU__；如果支持张量核心，则可以进一步提升性能。所有展示的结果均来自 RTX 3090。\n- 一个支持 __C++17__ 的编译器。推荐并经过测试的选项如下：\n  - __Windows：__ Visual Studio 2019 或 2022\n  - __Linux：__ GCC\u002FG++ 8 或更高版本\n- 一个较新的 __[CUDA](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit)__ 版本。推荐并经过测试的选项如下：\n  - __Windows：__ CUDA 11.5 或更高版本\n  - __Linux：__ CUDA 10.2 或更高版本\n- __[CMake](https:\u002F\u002Fcmake.org\u002F) v3.21 或更高版本__。\n- 本框架中的全融合 MLP 组件在其默认配置下需要 __非常大 的共享内存__。因此，它很可能仅能在 RTX 3090、RTX 2080 Ti 或更高端的 GPU 上运行。对于低端显卡，必须降低 `n_neurons` 参数，或者改用 `CutlassMLP`（兼容性更好但速度较慢）。\n\n如果你使用的是 Linux，请安装以下软件包：\n```sh\nsudo apt-get install build-essential git\n```\n\n我们还建议将 [CUDA](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit) 安装到 `\u002Fusr\u002Flocal\u002F` 目录，并将 CUDA 的安装路径添加到你的 PATH 环境变量中。例如，如果你安装了 CUDA 12.6.3，可以在 `~\u002F.bashrc` 文件中添加以下内容：\n```sh\nexport PATH=\"\u002Fusr\u002Flocal\u002Fcuda-12.6.3\u002Fbin:$PATH\"\nexport LD_LIBRARY_PATH=\"\u002Fusr\u002Flocal\u002Fcuda-12.6.3\u002Flib64:$LD_LIBRARY_PATH\"\n```\n\n\n## 编译（Windows 和 Linux）\n\n首先，使用以下命令克隆本仓库及其所有子模块：\n```sh\n$ git clone --recursive https:\u002F\u002Fgithub.com\u002Fnvlabs\u002Ftiny-cuda-nn\n$ cd tiny-cuda-nn\n```\n\n然后，使用 CMake 构建项目：（在 Windows 上，必须在 [开发者命令提示符](https:\u002F\u002Fdocs.microsoft.com\u002Fen-us\u002Fcpp\u002Fbuild\u002Fbuilding-on-the-command-line?view=msvc-160#developer_command_prompt) 中执行）\n```sh\ntiny-cuda-nn$ cmake . -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo\ntiny-cuda-nn$ cmake --build build --config RelWithDebInfo -j\n```\n\n如果编译无故失败或耗时超过一小时，可能是内存不足。此时可以尝试去掉 `-j` 参数重新编译。\n\n\n## PyTorch 扩展\n\n__tiny-cuda-nn__ 自带一个 [PyTorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch) 扩展，允许在 [Python](https:\u002F\u002Fwww.python.org\u002F) 环境中使用其高效的 MLP 和输入编码功能。这些绑定通常比纯 Python 实现快得多，尤其是在使用 [多分辨率哈希编码](https:\u002F\u002Fraw.githubusercontent.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fmaster\u002Fdata\u002Freadme\u002Fmultiresolution-hash-encoding-diagram.png) 时。\n\n> 不过，如果批量大小较小，Python\u002FPyTorch 的开销仍然会很大。\n> 例如，当批量大小为 64k 时，捆绑的 `mlp_learning_an_image` 示例通过 PyTorch 运行的速度比原生 CUDA 慢约 2 倍。\n> 而当批量大小达到 256k 或更高时（默认设置），两者的性能差距就会小得多。\n\n首先，设置一个支持 CUDA 的最新版 PyTorch 的 Python 3.X 环境。然后，运行以下命令安装扩展：\n```sh\npip install git+https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002F#subdirectory=bindings\u002Ftorch\n```\n\n或者，如果你希望从本地克隆的 __tiny-cuda-nn__ 安装，可以执行以下命令：\n```sh\ntiny-cuda-nn$ cd bindings\u002Ftorch\ntiny-cuda-nn\u002Fbindings\u002Ftorch$ python setup.py install\n```\n\n默认情况下，该扩展会在支持半精度计算的 GPU（Volta、Turing、Ampere 等）上自动启用 FP16，在旧架构或 FP16 性能较差的硬件（如 Pascal\u002FGTX 10 系列）上禁用 FP16。\n\n如果你想覆盖此行为（例如强制在不支持的硬件上启用 FP16，或为了调试而禁用 FP16），可以在安装前设置 TCNN_HALF_PRECISION 环境变量：\n\n禁用 FP16：0  \n启用 FP16：1  \n\n示例：\n```sh\n# Linux \u002F macOS（禁用 FP16）\nexport TCNN_HALF_PRECISION=0\npip install git+https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002F#subdirectory=bindings\u002Ftorch\n```\n\n安装成功后，你可以按照以下示例使用 __tiny-cuda-nn__ 模型：\n```py\nimport commentjson as json\nimport tinycudann as tcnn\nimport torch\n\nwith open(\"data\u002Fconfig_hash.json\") as f:\n\tconfig = json.load(f)\n\n# 选项 1：高效的编码+网络组合。\nmodel = tcnn.NetworkWithInputEncoding(\n\tn_input_dims, n_output_dims,\n\tconfig[\"encoding\"], config[\"network\"]\n)\n\n# 选项 2：分离模块。速度稍慢但更灵活。\nencoding = tcnn.Encoding(n_input_dims, config[\"encoding\"])\nnetwork = tcnn.Network(encoding.n_output_dims, n_output_dims, config[\"network\"])\nmodel = torch.nn.Sequential(encoding, network)\n\nmodel.jit_fusion = tcnn.supports_jit_fusion() # 可选：通过 JIT 融合加速\n```\n\n更多示例请参阅 `samples\u002Fmlp_learning_an_image_pytorch.py`。\n\n## 组件\n\n以下是该框架的组件概览。[JSON 文档](DOCUMENTATION.md)列出了配置选项。\n\n\n| 网络 | &nbsp; | &nbsp;\n| :--- | :---------- | :-----\n| 全融合 MLP | `src\u002Ffully_fused_mlp.cu` | 超快速的小型多层感知机（MLP）实现。\n| CUTLASS MLP     | `src\u002Fcutlass_mlp.cu`     | 基于 [CUTLASS](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass) GEMM 例程的 MLP。速度比全融合版本慢，但可以处理更大的网络，且仍然相当快。\n\n| 输入编码 | &nbsp; | &nbsp;\n| :--- | :---------- | :-----\n| 复合编码 | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Fcomposite.h` | 允许组合多种编码。例如，可用于构建神经辐射缓存编码 [[Müller 等人, 2021]](https:\u002F\u002Ftom94.net\u002F)。\n| 频率编码 | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Ffrequency.h` | NeRF 的 [[Mildenhall 等人, 2020]](https:\u002F\u002Fwww.matthewtancik.com\u002Fnerf) 位置编码，对所有维度均匀应用。\n| 格点编码 | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Fgrid.h` | 基于可训练的多分辨率格点的编码。用于 [即时神经图形基元 [[Müller 等人, 2022]](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002F)。这些格点可以由哈希表、密集存储或分块存储支持。\n| 恒等编码 | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Fidentity.h` | 不对输入值进行任何变换。\n| Oneblob 编码 | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Foneblob.h` | 来自神经重要性采样 [[Müller 等人, 2019]](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller18neural\u002Fmueller18neural-v4.pdf) 和神经控制变量 [[Müller 等人, 2020]](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller20neural\u002Fmueller20neural.pdf)。\n| 球谐函数编码 | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Fspherical_harmonics.h` | 一种频率空间编码，比逐分量编码更适合方向向量。\n| 三角波编码 | `include\u002Ftiny-cuda-nn\u002Fencodings\u002Ftriangle_wave.h` | NeRF 编码的低成本替代方案。用于神经辐射缓存 [[Müller 等人, 2021]](https:\u002F\u002Ftom94.net\u002F)。\n\n| 损失函数 | &nbsp; | &nbsp;\n| :--- | :---------- | :-----\n| L1 损失 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fl1.h` | 标准 L1 损失。\n| 相对 L1 损失 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fl1.h` | 以网络预测值归一化的相对 L1 损失。\n| MAPE 损失 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fmape.h` | 平均绝对百分比误差（MAPE）。与相对 L1 损失相同，但以目标值归一化。\n| SMAPE 损失 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fsmape.h` | 对称平均绝对百分比误差（SMAPE）。与相对 L1 损失相同，但以预测值和目标值的均值归一化。\n| L2 损失 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fl2.h` | 标准 L2 损失。\n| 相对 L2 损失 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Frelative_l2.h` | 以网络预测值归一化的相对 L2 损失 [[Lehtinen 等人, 2018]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fnoise2noise)。\n| 相对 L2 亮度损失 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Frelative_l2_luminance.h` | 与上一条相同，但以网络预测的亮度归一化。仅适用于网络输出为 RGB 的情况。用于神经辐射缓存 [[Müller 等人, 2021]](https:\u002F\u002Ftom94.net\u002F)。\n| 交叉熵损失 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fcross_entropy.h` | 标准交叉熵损失。仅适用于网络输出为概率密度函数（PDF）的情况。\n| 方差损失 | `include\u002Ftiny-cuda-nn\u002Flosses\u002Fvariance_is.h` | 标准方差损失。仅适用于网络输出为 PDF 的情况。\n\n| 优化器 | &nbsp; | &nbsp;\n| :--- | :---------- | :-----\n| Adam | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fadam.h` | 实现了 Adam [[Kingma 和 Ba, 2014]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1412.6980)，并扩展为 AdaBound [[Luo 等人, 2019]](https:\u002F\u002Fgithub.com\u002FLuolc\u002FAdaBound)。\n| Novograd | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Flookahead.h` | 实现了 Novograd [[Ginsburg 等人, 2019]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1905.11286)。\n| SGD | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fsgd.h` | 标准随机梯度下降（SGD）。\n| Shampoo | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fshampoo.h` | 实现了二阶 Shampoo 优化器 [[Gupta 等人, 2018]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.09568)，结合了自主研发的优化技术以及 [Anil 等人, 2020] 的改进。\n| 平均优化器 | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Faverage.h` | 包装另一个优化器，在最近 N 次迭代中计算权重的线性平均值。该平均值仅用于推理（不反馈回训练过程）。\n| 批量优化器 | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fbatched.h` | 包装另一个优化器，在每次 N 步时对平均梯度调用嵌套优化器一次。效果相当于增大批次大小，但只需恒定的内存开销。\n| 复合优化器 | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fcomposite.h` | 允许对不同参数使用多个优化器。\n| EMA 优化器 | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Faverage.h` | 包装另一个优化器，计算权重的指数移动平均值。该平均值仅用于推理（不反馈回训练过程）。\n| 指数衰减优化器 | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Fexponential_decay.h` | 包装另一个优化器，执行分段常数的指数学习率衰减。\n| Lookahead 优化器 | `include\u002Ftiny-cuda-nn\u002Foptimizers\u002Flookahead.h` | 包装另一个优化器，实现了 lookahead 算法 [[Zhang 等人, 2019]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1907.08610)。\n\n\n## 许可与引用\n\n本框架采用 BSD 3 条款许可证授权。详情请参阅 `LICENSE.txt` 文件。\n\n如果您在研究中使用本框架，请通过以下 BibTeX 格式引用：\n```bibtex\n@software{tiny-cuda-nn,\n\tauthor = {M\\\"uller, Thomas},\n\tlicense = {BSD-3-Clause},\n\tmonth = {4},\n\ttitle = {{tiny-cuda-nn}},\n\turl = {https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn},\n\tversion = {2.0},\n\tyear = {2021}\n}\n```\n\n如需商业合作，请访问我们的官网并提交表格：[NVIDIA Research Licensing](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fresearch\u002Finquiries\u002F)\n\n## 出版物与软件\n\n该框架支持以下出版物：\n\n> __具有多分辨率哈希编码的即时神经图形基元__  \n> [托马斯·穆勒](https:\u002F\u002Ftom94.net)、[亚历克斯·埃文斯](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Falex-evans)、[克里斯托夫·希德](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Fchristoph-schied)、[亚历山大·凯勒](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Falex-keller)  \n> _ACM 图形学汇刊 (__SIGGRAPH__)，2022年7月_  \n> __[网站](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002F)&nbsp;\u002F [论文](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.pdf)&nbsp;\u002F [代码](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Finstant-ngp)&nbsp;\u002F [视频](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.mp4)&nbsp;\u002F [BibTeX](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.bib)__\n\n> __从图像中提取三角形3D模型、材质和光照__  \n> [雅各布·蒙克贝格](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Fjacob-munkberg)、[乔恩·哈塞尔格伦](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Fjon-hasselgren)、[沈天畅](http:\u002F\u002Fwww.cs.toronto.edu\u002F~shenti11\u002F)、[高俊](http:\u002F\u002Fwww.cs.toronto.edu\u002F~jungao\u002F)、[陈文政](http:\u002F\u002Fwww.cs.toronto.edu\u002F~wenzheng\u002F)、[亚历克斯·埃文斯](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Falex-evans)、[托马斯·穆勒](https:\u002F\u002Ftom94.net)、[桑雅·菲德勒](https:\u002F\u002Fwww.cs.toronto.edu\u002F~fidler\u002F)  \n> __CVPR（口头报告）__，2022年6月  \n> __[网站](https:\u002F\u002Fnvlabs.github.io\u002Fnvdiffrec\u002F)&nbsp;\u002F [论文](https:\u002F\u002Fnvlabs.github.io\u002Fnvdiffrec\u002Fassets\u002Fpaper.pdf)&nbsp;\u002F [视频](https:\u002F\u002Fnvlabs.github.io\u002Fnvdiffrec\u002Fassets\u002Fvideo.mp4)&nbsp;\u002F [BibTeX](https:\u002F\u002Fnvlabs.github.io\u002Fnvdiffrec\u002Fassets\u002Fbib.txt)__\n\n> __用于路径追踪的实时神经辐射缓存__  \n> [托马斯·穆勒](https:\u002F\u002Ftom94.net)、[法布里斯·鲁塞尔](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Ffabrice-rousselle)、[扬·诺瓦克](http:\u002F\u002Fjannovak.info)、[亚历山大·凯勒](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Falex-keller)  \n> _ACM 图形学汇刊 (__SIGGRAPH__)，2021年8月_  \n> __[论文](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller21realtime\u002Fmueller21realtime.pdf)&nbsp;\u002F [GTC演讲](https:\u002F\u002Fgtc21.event.nvidia.com\u002Fmedia\u002FFully%20Fused%20Neural%20Network%20for%20Radiance%20Caching%20in%20Real%20Time%20Rendering%20%5BE31307%5D\u002F1_liqy6k1c)&nbsp;\u002F [视频](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller21realtime\u002Fmueller21realtime.mp4)&nbsp;\u002F [交互式结果查看器](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller21realtime\u002Finteractive-viewer\u002F)&nbsp;\u002F [BibTeX](https:\u002F\u002Ftom94.net\u002Fdata\u002Fpublications\u002Fmueller21realtime\u002Fmueller21realtime.bib)__\n\n\n此外，该框架还支持以下软件：\n\n> __NerfAcc：通用NeRF加速工具箱__  \n> [李瑞龙](https:\u002F\u002Fwww.liruilong.cn\u002F)、[马修·坦西克](https:\u002F\u002Fwww.matthewtancik.com\u002Fabout-me)、[安久·卡纳扎瓦](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~kanazawa\u002F)  \n> __https:\u002F\u002Fgithub.com\u002FKAIR-BAIR\u002Fnerfacc__\n\n> __Nerfstudio：神经辐射场开发框架__  \n> [马修·坦西克*](https:\u002F\u002Fwww.matthewtancik.com\u002Fabout-me)、[伊森·韦伯*](https:\u002F\u002Fethanweber.me\u002F)、[艾沃妮·吴*](http:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~evonne_ng\u002F)、[李瑞龙](https:\u002F\u002Fwww.liruilong.cn\u002F)、布伦特·易、泰伦斯·王、亚历山大·克里斯托弗森、杰克·奥斯汀、卡米亚尔·萨拉希、阿比克·阿胡贾、大卫·麦卡利斯特、[安久·卡纳扎瓦](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~kanazawa\u002F)  \n> __https:\u002F\u002Fgithub.com\u002Fnerfstudio-project\u002Fnerfstudio__\n\n如果您发现自己的出版物或软件未在此列出，欢迎随时提交Pull Request。\n\n## 致谢\n\n特别感谢NRC的作者们提供的有益讨论，以及[尼古劳斯·宾德](https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Fnikolaus-binder)为本框架提供部分基础设施，并协助在CUDA中利用TensorCore。","# Tiny CUDA Neural Networks 快速上手指南\n\nTiny CUDA Neural Networks (tcnn) 是一个轻量级、自包含的框架，专为在 NVIDIA GPU 上高效训练和推理神经网络而设计。其核心优势在于提供了极速的“全融合”多层感知机（Fully Fused MLP）和多分辨率哈希编码（Multiresolution Hash Encoding），广泛应用于即时神经图形基元（如 Instant NGP）等高性能场景。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下硬件和软件要求：\n\n### 硬件要求\n*   **GPU**: 必须配备 **NVIDIA GPU**。若具备 Tensor Cores（如 RTX 20\u002F30\u002F40 系列或 A100\u002FH100），性能将显著提升。\n    *   *注意*：默认的“全融合”MLP 需要大量共享内存，建议在 RTX 3090、RTX 2080 Ti 或更高阶显卡上运行。低端显卡需减少 `n_neurons` 参数或改用 `CutlassMLP`。\n\n### 软件依赖\n*   **操作系统**: Windows 或 Linux\n*   **编译器**: 支持 C++17\n    *   Windows: Visual Studio 2019 或 2022\n    *   Linux: GCC\u002FG++ 8 或更高版本\n*   **CUDA Toolkit**:\n    *   Windows: CUDA 11.5 或更高\n    *   Linux: CUDA 10.2 或更高\n*   **构建工具**: CMake v3.21 或更高\n*   **其他 (Linux)**: `build-essential`, `git`\n\n**Linux 用户前置安装命令：**\n```bash\nsudo apt-get install build-essential git\n```\n\n**配置 CUDA 环境变量 (Linux 示例):**\n建议将 CUDA 安装在 `\u002Fusr\u002Flocal\u002F` 并添加到路径。以 CUDA 12.6.3 为例，请在 `~\u002F.bashrc` 中添加：\n```bash\nexport PATH=\"\u002Fusr\u002Flocal\u002Fcuda-12.6.3\u002Fbin:$PATH\"\nexport LD_LIBRARY_PATH=\"\u002Fusr\u002Flocal\u002Fcuda-12.6.3\u002Flib64:$LD_LIBRARY_PATH\"\n```\n\n## 安装步骤\n\n### 方式一：编译 C++ 原生库\n\n1.  **克隆仓库**\n    使用 `--recursive` 参数拉取代码及子模块：\n    ```bash\n    git clone --recursive https:\u002F\u002Fgithub.com\u002Fnvlabs\u002Ftiny-cuda-nn\n    cd tiny-cuda-nn\n    ```\n\n2.  **构建项目**\n    使用 CMake 进行配置和编译（Windows 用户请在 Developer Command Prompt 中执行）：\n    ```bash\n    cmake . -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo\n    cmake --build build --config RelWithDebInfo -j\n    ```\n    *提示：如果编译过程中内存不足导致失败或耗时过长，请移除 `-j` 参数尝试单线程编译。*\n\n### 方式二：安装 PyTorch 扩展 (Python)\n\n如果您希望在 Python 环境中使用，请先确保已安装支持 CUDA 的 PyTorch。\n\n**在线安装：**\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002F#subdirectory=bindings\u002Ftorch\n```\n\n**本地源码安装：**\n```bash\ncd bindings\u002Ftorch\npython setup.py install\n```\n\n**关于半精度 (FP16) 的控制：**\n默认情况下，库会自动检测显卡架构启用或禁用 FP16。如需强制控制，可在安装前设置环境变量：\n*   禁用 FP16: `export TCNN_HALF_PRECISION=0`\n*   启用 FP16: `export TCNN_HALF_PRECISION=1`\n\n## 基本使用\n\n### 1. Python (PyTorch) 使用示例\n\n这是最常用的方式，适合快速原型开发和集成到现有 PyTorch 项目中。\n\n```python\nimport commentjson as json\nimport tinycudann as tcnn\nimport torch\n\n# 加载配置文件 (假设 config_hash.json 存在于当前目录)\nwith open(\"data\u002Fconfig_hash.json\") as f:\n    config = json.load(f)\n\n# 定义输入输出维度\nn_input_dims = 2\nn_output_dims = 3\n\n# 创建模型：组合编码器和网络\nmodel = tcnn.NetworkWithInputEncoding(\n    n_input_dims=n_input_dims,\n    n_output_dims=n_output_dims,\n    encoding_config=config[\"encoding\"],\n    network_config=config[\"network\"]\n)\n\n# 可选：启用 JIT 融合以获得 1.5x - 2.5x 的性能提升\nmodel.jit_fusion = tcnn.supports_jit_fusion()\n\n# 准备数据 (需在 GPU 上)\ninputs = torch.rand(1024, n_input_dims, device=\"cuda\", dtype=torch.float32)\ntargets = torch.rand(1024, n_output_dims, device=\"cuda\", dtype=torch.float32)\n\n# 前向传播\noutputs = model(inputs)\n\n# 计算损失并反向传播\nloss = ((outputs - targets) ** 2).mean()\nloss.backward()\n\n# 优化器步骤 (需配合 torch.optim 使用)\noptimizer = torch.optim.Adam(model.parameters(), lr=1e-3)\noptimizer.step()\n```\n\n### 2. C++ 原生 API 使用示例\n\n适合对性能有极致追求或需要深度定制 CUDA Kernel 的场景。\n\n```cpp\n#include \u003Ctiny-cuda-nn\u002Fcommon.h>\n#include \u003Ctiny-cuda-nn\u002Ftrainer.h>\n#include \u003Cnlohmann\u002Fjson.hpp>\n\nusing namespace tcnn;\n\nint main() {\n    \u002F\u002F 1. 配置模型 (JSON 格式)\n    nlohmann::json config = {\n        {\"loss\", {{\"otype\", \"L2\"}}},\n        {\"optimizer\", {{\"otype\", \"Adam\"}, {\"learning_rate\", 1e-3}}},\n        {\"encoding\", {\n            {\"otype\", \"HashGrid\"},\n            {\"n_levels\", 16},\n            {\"n_features_per_level\", 2},\n            {\"log2_hashmap_size\", 19},\n            {\"base_resolution\", 16},\n            {\"per_level_scale\", 2.0},\n        }},\n        {\"network\", {\n            {\"otype\", \"FullyFusedMLP\"},\n            {\"activation\", \"ReLU\"},\n            {\"output_activation\", \"None\"},\n            {\"n_neurons\", 64},\n            {\"n_hidden_layers\", 2},\n        }},\n    };\n\n    uint32_t n_input_dims = 2;\n    uint32_t n_output_dims = 3;\n\n    \u002F\u002F 2. 创建模型\n    auto model = create_from_config(n_input_dims, n_output_dims, config);\n    \n    \u002F\u002F 可选：启用 JIT 融合加速\n    model->set_jit_fusion(supports_jit_fusion());\n\n    \u002F\u002F 3. 准备训练数据 (GPUMatrix 需位于显存)\n    uint32_t batch_size = 1024; \n    \u002F\u002F 注意：batch_size 必须是 tcnn::BATCH_SIZE_GRANULARITY 的倍数\n    GPUMatrix\u003Cfloat> training_batch_inputs(n_input_dims, batch_size);\n    GPUMatrix\u003Cfloat> training_batch_targets(n_output_dims, batch_size);\n\n    \u002F\u002F 此处应填入生成\u002F加载数据的逻辑\n    \u002F\u002F generate_training_batch(&training_batch_inputs, &training_batch_targets); \n\n    \u002F\u002F 4. 训练循环\n    int n_training_steps = 1000;\n    for (int i = 0; i \u003C n_training_steps; ++i) {\n        float loss;\n        \u002F\u002F 执行一步训练\n        model->trainer->training_step(training_batch_inputs, training_batch_targets, &loss);\n        \n        if (i % 100 == 0) {\n            std::cout \u003C\u003C \"iteration=\" \u003C\u003C i \u003C\u003C \" loss=\" \u003C\u003C loss \u003C\u003C std::endl;\n        }\n    }\n\n    \u002F\u002F 5. 推理\n    GPUMatrix\u003Cfloat> inference_outputs(n_output_dims, batch_size);\n    model->network->inference(training_batch_inputs, inference_outputs);\n\n    return 0;\n}\n```\n\n### 性能提示\n*   **JIT Fusion**: 在大多数现代 GPU (RTX 3000 系列及更新) 上，强烈建议开启 `jit_fusion`，可获得显著加速。若遇到显存不足或大模型训练变慢的情况，可尝试关闭它。\n*   **Batch Size**: PyTorch 绑定在小 Batch Size 下会有较大的 Python 开销，建议尽量使用较大的 Batch Size (如 64k 以上) 以发挥最大性能。","某自动驾驶仿真团队需要在 RTX 3090 显卡上实时训练高保真神经辐射场（NeRF），以重建复杂城市街道的 3D 场景并支持动态光照渲染。\n\n### 没有 tiny-cuda-nn 时\n- **训练速度极慢**：使用传统 TensorFlow 框架训练多层感知机（MLP），单次迭代耗时数百毫秒，完成高质量场景重建需数天甚至数周。\n- **显存占用过高**：常规神经网络结构在处理高分辨率哈希编码时显存爆炸，导致无法在单卡上运行大尺度场景，被迫降低分辨率牺牲细节。\n- **推理延迟严重**：模型查询速度慢，无法满足仿真系统对实时渲染（>30 FPS）的严苛要求，画面出现明显卡顿。\n- **代码耦合度高**：需要自行编写复杂的 CUDA 核函数来优化矩阵运算，开发周期长且极易出错，难以快速验证新算法。\n\n### 使用 tiny-cuda-nn 后\n- **训练效率飞跃**：利用其“全融合”MLP 架构，将训练速度提升 10 倍以上，原本需一周的训练任务缩短至数小时内完成。\n- **显存利用极致**：内置的多分辨率哈希编码技术大幅压缩参数量，使得在单张消费级显卡上也能流畅训练亿级参数的大场景。\n- **实时推理达成**：结合 JIT 融合技术，推理延迟降低至微秒级，轻松实现 60 FPS 以上的实时高清场景漫游与光照变化模拟。\n- **开发聚焦核心**：通过简洁的 C++\u002FJSON 配置接口即可调用底层高度优化的算子，工程师无需关注底层 CUDA 细节，专注于场景逻辑创新。\n\ntiny-cuda-nn 通过将底层算力压榨到极致，让实时高保真 3D 场景重建从理论走向工程落地。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_tiny-cuda-nn_feb6d888.png","NVlabs","NVIDIA Research Projects","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNVlabs_fc20d641.jpg","",null,"http:\u002F\u002Fresearch.nvidia.com","https:\u002F\u002Fgithub.com\u002FNVlabs",[80,84,88,92,96,100],{"name":81,"color":82,"percentage":83},"C++","#f34b7d",72.7,{"name":85,"color":86,"percentage":87},"Cuda","#3A4E3A",18.4,{"name":89,"color":90,"percentage":91},"Python","#3572A5",6.2,{"name":93,"color":94,"percentage":95},"CMake","#DA3434",2.5,{"name":97,"color":98,"percentage":99},"Nix","#7e7eff",0.1,{"name":101,"color":102,"percentage":103},"Shell","#89e051",0,4456,561,"2026-04-09T19:00:02","NOASSERTION",4,"Linux, Windows","必需 NVIDIA GPU（支持 Tensor Core 更佳）。完全融合 MLP 默认配置需要大量共享内存，推荐 RTX 3090、RTX 2080 Ti 或更高端显卡；低端显卡需减少神经元数量或使用 CutlassMLP。CUDA 版本要求：Windows 需 11.5+，Linux 需 10.2+。","未说明（但编译时若内存不足可能导致失败）",{"notes":113,"python":114,"dependencies":115},"1. 编译器要求：Windows 需 Visual Studio 2019\u002F2022，Linux 需 GCC\u002FG++ 8+。2. 完全融合 MLP 组件对显存和共享内存要求极高，低端卡需调整配置。3. 若使用 Linux，需安装 build-essential 和 git。4. JIT 融合功能在较新 GPU 上可提升 1.5-2.5 倍性能，但在大模型或旧显卡上可能变慢。5. PyTorch 绑定在小批量数据下开销较大。","3.X (用于 PyTorch 扩展)",[116,117,118,119,120,121],"CMake >= 3.21","CUDA Toolkit","PyTorch (CUDA enabled)","nlohmann\u002Fjson","fmt","commentjson",[14],[124,125,126,127,128,129,130,131,132],"neural-network","deep-learning","gpu","cuda","mlp","nerf","rendering","real-time","pytorch","2026-03-27T02:49:30.150509","2026-04-10T20:34:10.076031",[136,141,146,150,155,159,164],{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},28203,"如何使用特定版本的 tiny-cuda-nn 来避免安装问题？","如果最新版安装困难，可以尝试安装旧版本 v1.4。有用户确认 v1.4 版本在特定环境下可以成功安装。你可以通过查看项目的 Release 页面获取该版本的标签或源码进行安装。","https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fissues\u002F175",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},28204,"警告 'FullyFusedMLP is not supported for the selected architecture' 是什么意思？","这意味着你的 GPU 架构版本较低（如 GTX 1080 Ti 的架构 61），不支持全融合 MLP（FullyFusedMLP）优化，系统会自动回退到 CutlassMLP。虽然程序仍可运行，但性能不是最优。若要获得最大性能，建议使用架构版本为 75 或更高的 GPU（如 RTX 20 系列及以上）。","https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fissues\u002F47",{"id":147,"question_zh":148,"answer_zh":149,"source_url":140},28200,"编译 tiny-cuda-nn 时内存占用过高导致失败，需要多少内存？","编译过程确实非常消耗内存（用户报告需要约 130GB）。这通常是由于 CUDA 和 GCC 版本不兼容导致的低效编译。解决方案是升级环境：尝试使用更新的 Docker 镜像（如 `nvidia\u002Fcuda:11.8.0-cudnn8-devel-ubuntu22.04`），搭配 GCC 11.2.0 和 CUDA 11.8。相比之下，旧组合（如 CUDA 11.3 + GCC 9.4.0）极易引发此问题。如果无法升级硬件，请尝试更新 CUDA Toolkit 和编译器版本。",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},28201,"在 Windows 上安装时遇到 'Underlying buffer has been detached' 或 MSVC 编译错误怎么办？","这类错误通常与 CUDA 和 Visual Studio (MSVC) 版本不兼容有关，而非库本身的问题。建议步骤：\n1. 下载并安装官方最新版的 NVIDIA CUDA Toolkit（系统级安装）。\n2. 将 Visual Studio 和 Build Tools 更新到最新版本。\n3. 重启电脑。\n4. 尝试在非 Conda 环境中通过 pip 安装。\n注意：有用户反馈在 Conda 环境中指定 `cudatoolkit=11.3` 能成功，而更高版本（11.4-11.7）会失败，因此可能需要降级 Conda 中的 cudatoolkit 至 11.3 以匹配系统环境。","https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fissues\u002F100",{"id":156,"question_zh":157,"answer_zh":158,"source_url":145},28202,"运行示例脚本时出现 'Got cutlass error: Error Internal' 错误如何解决？","该错误通常由 CUDA 版本过旧引起（例如用户遇到的 CUDA 11.0）。解决方案是将 CUDA 升级到较新版本（如 11.3 或 11.6+）。如果升级后在安装 `setup.py` 时遇到其他问题，可能需要修改 `setup.py` 文件以匹配最新的代码提交（参考 commit fb8f845），或者确保完全卸载旧版本后重新安装。",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},28205,"在 Linux 下编译失败报错 'nvcc fatal : A single input file is required...' 是什么原因？","这是一个构建配置错误，通常发生在 CMake 调用 nvcc 时参数传递不正确。这往往是因为构建脚本或 CMakeLists.txt 与当前的 CUDA 工具链版本不兼容。确保使用的是与项目兼容的 CUDA 版本，并尝试清理构建缓存（删除 build 目录）后重新运行 cmake 和 make。如果问题依旧，检查是否混用了不同版本的 nvcc 和 gcc。","https:\u002F\u002Fgithub.com\u002FNVlabs\u002Ftiny-cuda-nn\u002Fissues\u002F3",{"id":165,"question_zh":166,"answer_zh":167,"source_url":154},28206,"Conda 环境与系统级 CUDA 安装冲突导致安装失败怎么办？","Conda 自带的 `cudatoolkit` 有时仅包含运行时库，缺少编译所需的头文件或完整工具链，导致编译扩展失败。推荐做法是：\n1. 在系统层面安装完整的 NVIDIA CUDA Toolkit。\n2. 在 Conda 环境中安装 PyTorch 时，尽量让 PyTorch 使用系统级的 CUDA，或者确保 Conda 的 cudatoolkit 版本与系统级驱动\u002F编译器严格匹配（如锁定为 11.3）。\n3. 如果可能，尝试跳出 Conda 环境，使用系统 Python 和 pip 进行安装测试，以排除环境隔离带来的路径问题。",[169,174,179,184,189,194,199],{"id":170,"version":171,"summary_zh":172,"released_at":173},189110,"v2.0","tiny-cuda-nn 现在提供即时编译（JIT）模式，可将编码、神经网络、损失函数，甚至反向传播融合为单个 CUDA 内核。这使得推理和训练速度*开箱即用*提升 1.5 至 2.5 倍，并且只需一行代码即可启用，请参阅下方的“自动 JIT”部分。\n\n当应用程序与 tiny-cuda-nn 的新 JIT 编译器进行深度集成时，还可获得更大的性能提升。例如，[Instant NGP](https:\u002F\u002Fgithub.com\u002Fnvlabs\u002Finstant-ngp) 通过将整个 NeRF 射线追踪器融合为一个内核，实现了 5 倍的加速。有关如何实现这一点的详细信息，请参阅“直接 JIT 集成”部分。\n\n## 自动 JIT\n\n要启用 JIT 编译模式，只需将模型的 `jit_fusion` 属性设置为 `true`。此后，该模型的所有使用场景（无论是推理还是训练）都将采用 JIT 模式。请注意，如果 JIT 编译过程中出现错误，系统会发出警告并自动关闭 JIT 编译模式。此时，您的代码仍将使用 tiny-cuda-nn 1.X 的代码路径运行。\n\n```cpp\nauto model = tcnn::create_from_config(...);\nmodel->set_jit_fusion(tcnn::supports_jit_fusion()); \u002F\u002F 如果系统支持，则启用 JIT\n```\n\n**注意：** 如果您的模型包含非常大的哈希网格（约 2000 万+ 参数）或 MLP（每层神经元数超过 128 个），或者您的 GPU 是 RTX 3000 系列及更早型号，JIT 融合*可能会*降低训练速度，偶尔也会影响推理速度。在这种情况下，建议分别针对训练和推理启用 JIT 融合，以评估其是否确实带来性能提升。\n\nJIT 融合也可以通过 PyTorch 绑定来启用，但相对于原生 C++ 接口，其加速效果会较低，尤其是在训练阶段。这是因为，在 PyTorch 中，JIT 编译器无法访问完整的计算图，因此能够融合和优化的内容较少。\n\n```python\nimport tinycudann as tcnn\n\nmodel = tcnn.NetworkWithInputEncoding(...) # 或任何其他 tcnn 模型\nmodel.jit_fusion = tcnn.supports_jit_fusion() # 如果系统支持，则启用 JIT\n```\n\n## 直接 JIT 集成\n\ntiny-cuda-nn 2.0 的 JIT 编译器的工作原理是：将给定的 tiny-cuda-nn 模型转换为 CUDA 设备函数，然后利用 CUDA 的运行时编译（RTC）功能将其编译为内核。\n\n要将 tiny-cuda-nn 模型与您应用程序中的更大内核集成，您需要：\n1. 将您的内核代码转换为字符串；\n2. 在字符串前添加 tiny-cuda-nn 模型的设备函数；\n3. 将结果传递给 tiny-cuda-nn 的运行时编译 API。\n\n以下是一个示例，展示了如何使用具有 32 个输入维度和 16 个输出维度的 tiny-cuda-nn 模型实现一个最小化内核：\n```cpp\n#include \u003Ctiny-cuda-nn\u002Frtc_kernel.h>\n\nauto model = tcnn::create_from_config(32 \u002F* 输入维度 *\u002F, 16 \u002F* 输出维度 *\u002F, ...);\nauto fused_kernel = tcnn::CudaRtcKernel(\n    \"your_kernel\",\n    fmt::format(R\"\n        {MODEL_DEVICE_FUNCTION}\n        __global__ void your_kernel(...) {\n            \u002F\u002F 从寄存器或内存中获取模型的输入。\n      ","2025-07-08T11:27:16",{"id":175,"version":176,"summary_zh":177,"released_at":178},189111,"v1.6","自四月以来，__tiny-cuda-nn__ 已经经历了诸多改进，并且其当前状态也已稳定运行了一段时间。因此，我认为现在是发布新版本的好时机。\n\n# 自上一版本以来的变更\n\n- **多 GPU 支持：** __tiny-cuda-nn__ 现在可以同时在多个 GPU 上运行。用户需要确保参数、输入、输出以及流都位于当前激活的 CUDA 设备上。\n  - PyTorch 的多 GPU 操作现可开箱即用。\n- **CMake 改进：** 当将 __tiny-cuda-nn__ 作为 CMake 子模块使用时，其头文件目录和库文件现在会被纳入 `PUBLIC` 接口进行管理。这意味着父项目只需以下两行 CMake 代码，即可在其 CUDA 代码中使用 __tiny-cuda-nn__：\n  ```cmake\n  add_subdirectory(dependencies\u002Ftiny-cuda-nn)\n  target_link_libraries(\u003Cparent project> PUBLIC tiny-cuda-nn)\n  ```\n- **各项功能升级：**\n  - `AdamOptimizer` 现在支持权重裁剪。\n  - 新增了 `CompositeOptimizer`（由 @Solonets 贡献）。它可以使用不同的优化器来优化模型的不同部分（例如编码器和神经网络），从而实现不同的学习率设置。\n  - `CompositeEncoding` 现在可以在其嵌套编码之上执行求和或乘积约简操作。\n  - `Encoding` 的输入和输出矩阵对齐过程已被简化，目前应在所有情况下自动生效。\n  - 许多过去可能导致未定义行为的情况现在都会被检查，并抛出带有详细信息的异常。\n  - 参数初始化 `model->initialize_params(...)` 和参数设置 `model->set_params(...)` 已经解耦。在使用模型之前，必须先调用 `set_params`。而调用 `initialize_params` 不再影响模型的实际参数，而是仅返回一组适合作为训练初始状态的参数。\n  - 快照现在在 `CutlassMLP` 和 `FullyFusedMLP` 之间，以及在 `float` 和 `__half` 精度之间均兼容。这意味着任何 GPU 生成的快照都可以被其他任何 GPU 加载。\n  - `GridEncoding` 的哈希函数现在可以进行配置。\n- **无数的错误修复与性能提升。**","2022-12-15T14:59:07",{"id":180,"version":181,"summary_zh":182,"released_at":183},189112,"v1.5","# 自上一版本以来的变更\r\n\r\n- 在 __tiny-cuda-nn__ 中，编码器和神经网络现在共享用于可微对象的通用 API。这大大简化了实现。\r\n  - 作为这一通用化的一部分，编码器和神经网络现在可以接受并输出行主序和列主序的矩阵（即 AoS 和 SoA 数据）。此外，输入数据还可以以任意步长进行访问，从而允许对输入矩阵进行切片而无需复制。\r\n- 添加了对 `GridEncoding` 的双反向传播支持，这对于例如 eikonal 监督等场景非常有用（感谢 @ventusff）。\r\n- 在示例应用中移除了对 PyEXR \u002F tinyexr 的依赖，改用 `imageio` \u002F `stb_image`。\r\n- 修复了 __大量__ Bug，增加了多项性能优化，并提升了与旧版 GPU 的兼容性。\r\n","2022-04-22T07:20:33",{"id":185,"version":186,"summary_zh":187,"released_at":188},189113,"v1.4","# 自上一版本以来的变更\n\n## 主要变更\n- __新增了一个 PyTorch 扩展，用于在 Python 中使用 tiny-cuda-nn。__\n  - 此功能目前处于“beta”阶段。如果您遇到任何问题，请务必报告！\n  - 请参阅 [README 中的这一节](https:\u002F\u002Fgithub.com\u002Fnvlabs\u002Ftiny-cuda-nn#pytorch-extension)，以获取安装和使用说明。\n  - 注意：Python\u002FPyTorch 的开销可能较大。例如，打包的 mlp_learning_an_image 示例通过 PyTorch 运行时比原生 CUDA 慢约 2 倍。（尽管这仍然比完全用 Python 从头实现要快，但仍需注意这一点。）\n- __显著降低了内存占用（有时可降低至原来的三分之一）__\n  - 添加了一个 GPU 内存池，支持高效、按流顺序分配和释放临时缓冲区。这避免了预先分配内存的需求，从而通常使内存消耗降低至原来的__三分之一__。\n  - 该内存池利用 GPU 的虚拟内存映射机制来实现高性能，同时不会使指针失效或重新排列内存。\n- __tiny-cuda-nn__ 中的所有神经网络现在还额外支持_行主序_的输入内存布局。这在原本需要转置的情况下，能够带来更高的性能和更低的内存占用。\n  - `GridEncoding` 天然输出行主序数据，因此在其后接神经网络时，速度可提升约__20%__。\n- __tiny-cuda-nn__ 现在可以在计算能力低至 37 的旧版 GPU 上运行。\n\n## 次要变更\n- 将 `GridEncoding` 的输入梯度计算速度提升了约 3 倍。\n- 提升了 `SyncedMultiStream` 的速度。\n- 修复了 `SphericalHarmonicsEncoding` 的梯度计算错误。\n- 修复了在提供 `max_level` 参数或使用 `Interpolation::Nearest` 时，`GridEncoding` 的梯度计算错误。","2022-02-14T14:53:12",{"id":190,"version":191,"summary_zh":192,"released_at":193},189114,"v1.3","# 自上一版本以来的更改\r\n\r\n\u003Cimg src=\"https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets_readme\u002Ffox.gif\" height=\"338\"\u002F> \u003Cimg src=\"https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets_readme\u002Frobot5.gif\" height=\"338\"\u002F>\r\n\r\n## 主要更改\r\n- 新增一种编码：`GridEncoding`\r\n  - 该编码可用于[即时训练和渲染神经图形原语](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002F)（参见上方的实时 NeRF 飞行演示）\r\n  - 它基于可训练的多分辨率网格概念，这些网格可以由哈希表、密集存储或分块存储来支持。\r\n  - 更多细节请参阅[这篇技术论文](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.pdf)。\r\n- __tiny-cuda-nn__ 现在可在 CUDA 10.2 上运行（此前要求 CUDA 11 或更高版本）\r\n- __tiny-cuda-nn__ 现在仅需 C++14 标准（此前为 C++17）\r\n\r\n\r\n## 次要更改\r\n- 本仓库现通过 GitHub Actions 支持持续集成构建。\r\n- 增加了对宽度为 16 个神经元的 `FullyFusedMLP` 的支持\r\n- 增加了对 `SyncedMultiStream` 嵌套的支持","2022-01-14T09:47:39",{"id":195,"version":196,"summary_zh":197,"released_at":198},189115,"v1.2","# 自上一版本以来的变更\r\n\r\n## 主要变更\r\n- 新增三种编码：(i) `TriangleWave`，(ii) `SphericalHarmonics`，(iii) `Composite`\r\n- 现在使用带音高指针来参数化所有编码的输入和输出。\r\n  - 此功能引入了一种新的 `Composite` 编码，可以将基础编码应用于输入维度的不同子集。\r\n  - 同时也取消了“编码维度”与“直通维度”的区分。旧版中某些维度的直通行为可以通过与 `Identity` 编码组合来实现。\r\n- __tiny-cuda-nn__ 不再依赖 cuRAND，而是改用基于 PCG32 随机数生成器的实现（源自 https:\u002F\u002Fgithub.com\u002Fwjakob\u002Fpcg32）来处理所有随机性相关操作。\r\n- 激活函数的代码已在 CUTLASS 组件内部及组件之间实现集中化管理。所有神经网络实现现在都支持所有激活函数（除了 `ResNet` 模型，其隐藏层仍仅支持 `ReLU` 激活）。\r\n\r\n\r\n## 次要变更\r\n- CMake 现在能够正确自动检测并针对已安装的 GPU 进行编译。\r\n- 当 __tiny-cuda-nn__ 作为子模块使用时，现在可以禁用示例和基准测试。\r\n- 放宽了对 CUDA 版本的要求。未来计划支持 CUDA 10.2。\r\n\r\n","2021-12-15T14:43:45",{"id":200,"version":201,"summary_zh":202,"released_at":203},189116,"v1.1","# 自上一版本以来的变更\n\n## 主要变更\n- __tiny-cuda-nn__ 现在支持通过 `Trainer::serialize` 和 `Trainer::deserialize` 保存和加载模型快照。这些函数会生成一个 `nlohmann::json` 对象，其中包含训练好的模型参数，以及可选的优化器状态（以便继续训练）。\n\n将生成的 JSON 数据高效地存储到磁盘上的推荐方式如下：\n```c++\nstd::ofstream f(\"checkpoint.msgpack\", std::ios::out | std::ios::binary);\njson::to_msgpack(trainer->serialize(), f);\n```\n再次加载时可以使用以下代码：\n```c++\nstd::ifstream f{\"checkpoint.msgpack\", std::ios::in | std::ios::binary};\ntrainer->deserialize(json::from_msgpack(f));\n```\n\n- __tiny-cuda-nn__ 现在支持 L1 类型的损失函数。新增了四种损失：`L1`、`相对 L1`、`MAPE`（平均绝对百分比误差）和 `SMAPE`（对称平均绝对百分比误差）。\n- `GPUMatrix` 的使用变得更加简洁。列主序矩阵现在使用 `GPUMatrix\u003CT>` 类型，而行主序矩阵则使用 `GPUMatrix\u003CT, RM>`。此外，还引入了一种动态布局的矩阵类型：`GPUMatrixDynamic\u003CT>`。因此，动态布局网络输出的 API 现在也得到了简化。\n\n## 次要变更\n- 扩展了 `Network`\u002F`NetworkWithInputEncoding` 的功能，使其支持诸如提取神经元激活值或输出对输入的梯度等特性。\n- 向 `FullyFusedMLP` 中添加了 `Squareplus` 和 `Softplus` 激活函数。\n- CMake 现在能够自动检测系统的 GPU 架构，从而简化了针对 Turing 和 A100 GPU 的编译流程（请参阅更新后的 `README.md`）。\n- 从所有损失函数中移除了 `data_factor` 参数。若需实现相同的行为，请将现有损失函数封装在一个辅助类中。\n\n","2021-10-30T08:50:37"]