[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-pytorch--TensorRT":3,"tool-pytorch--TensorRT":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160784,2,"2026-04-19T11:32:54",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":72,"owner_avatar_url":73,"owner_bio":74,"owner_company":75,"owner_location":75,"owner_email":75,"owner_twitter":75,"owner_website":76,"owner_url":77,"languages":78,"stars":117,"forks":118,"last_commit_at":119,"license":120,"difficulty_score":10,"env_os":121,"env_gpu":122,"env_ram":123,"env_deps":124,"category_tags":131,"github_topics":132,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":140,"updated_at":141,"faqs":142,"releases":174},9749,"pytorch\u002FTensorRT","TensorRT","PyTorch\u002FTorchScript\u002FFX compiler for NVIDIA GPUs using TensorRT","Torch-TensorRT 是一款专为 NVIDIA GPU 设计的开源编译器，旨在将 TensorRT 的高性能推理能力无缝集成到 PyTorch 生态中。它主要解决了深度学习模型在部署阶段推理延迟高、资源消耗大的痛点，通过编译优化，能让 PyTorch 模型的推理速度相比原生执行提升高达 5 倍。\n\n这款工具非常适合需要在 NVIDIA 平台上部署高效 AI 应用的开发者与研究人员。无论是希望快速验证模型加速效果的研究员，还是追求极致性能的生产环境工程师，都能从中受益。Torch-TensorRT 的独特亮点在于其极简的使用方式：用户只需一行代码，利用熟悉的 `torch.compile` 接口并指定 backend 为\"tensorrt\"，即可自动完成复杂的图优化与算子融合。此外，它还支持“提前编译”工作流，允许将优化后的模型序列化为独立文件，不仅能在 Python 环境中直接加载，更能轻松部署于无 Python 依赖的 C++ 生产环境，极大地提升了落地的灵活性与效率。","\u003Cdiv align=\"center\">\n\nTorch-TensorRT\n===========================\n\u003Ch4> Easily achieve the best inference performance for any PyTorch model on the NVIDIA platform. \u003C\u002Fh4>\n\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-master-brightgreen)](https:\u002F\u002Fnvidia.github.io\u002FTorch-TensorRT\u002F)\n[![pytorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.12-green)](https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu130)\n[![cuda](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCUDA-13.0-green)](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads)\n[![trt](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTensorRT-10.14.1-green)](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Ftensorrt)\n[![license](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-BSD--3--Clause-blue)](.\u002FLICENSE)\n[![Linux x86-64 Nightly Wheels](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-linux-x86_64.yml\u002Fbadge.svg?branch=nightly)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-linux-x86_64.yml)\n[![Linux SBSA Nightly Wheels](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-linux-aarch64.yml\u002Fbadge.svg?branch=nightly)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-linux-aarch64.yml)\n[![Windows Nightly Wheels](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-windows.yml\u002Fbadge.svg?branch=nightly)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-windows.yml)\n\n---\n\u003Cdiv align=\"left\">\n\nTorch-TensorRT brings the power of TensorRT to PyTorch. Accelerate inference latency by up to 5x compared to eager execution in just one line of code.\n\u003C\u002Fdiv>\u003C\u002Fdiv>\n\n## Installation\nStable versions of Torch-TensorRT are published on PyPI\n```bash\npip install torch-tensorrt\n```\n\nNightly versions of Torch-TensorRT are published on the PyTorch package index\n```bash\npip install --pre torch-tensorrt --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu130\n```\n\nTorch-TensorRT is also distributed in the ready-to-run [NVIDIA NGC PyTorch Container](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fcontainers\u002Fpytorch) which has all dependencies with the proper versions and example notebooks included.\n\nFor more advanced installation  methods, please see [here](https:\u002F\u002Fpytorch.org\u002FTensorRT\u002Fgetting_started\u002Finstallation.html)\n\n## Quickstart\n\n### Option 1: torch.compile\nYou can use Torch-TensorRT anywhere you use `torch.compile`:\n\n```python\nimport torch\nimport torch_tensorrt\n\nmodel = MyModel().eval().cuda() # define your model here\nx = torch.randn((1, 3, 224, 224)).cuda() # define what the inputs to the model will look like\n\noptimized_model = torch.compile(model, backend=\"tensorrt\")\noptimized_model(x) # compiled on first run\n\noptimized_model(x) # this will be fast!\n```\n\n### Option 2: Export\nIf you want to optimize your model ahead-of-time and\u002For deploy in a C++ environment, Torch-TensorRT provides an export-style workflow that serializes an optimized module. This module can be deployed in PyTorch or with libtorch (i.e. without a Python dependency).\n\n#### Step 1: Optimize + serialize\n```python\nimport torch\nimport torch_tensorrt\n\nmodel = MyModel().eval().cuda() # define your model here\ninputs = [torch.randn((1, 3, 224, 224)).cuda()] # define a list of representative inputs here\n\ntrt_gm = torch_tensorrt.compile(model, ir=\"dynamo\", inputs=inputs)\ntorch_tensorrt.save(trt_gm, \"trt.ep\", inputs=inputs) # PyTorch only supports Python runtime for an ExportedProgram. For C++ deployment, use a TorchScript file\ntorch_tensorrt.save(trt_gm, \"trt.ts\", output_format=\"torchscript\", inputs=inputs)\n```\n\n#### Step 2: Deploy\n##### Deployment in PyTorch:\n```python\nimport torch\nimport torch_tensorrt\n\ninputs = [torch.randn((1, 3, 224, 224)).cuda()] # your inputs go here\n\n# You can run this in a new python session!\nmodel = torch.export.load(\"trt.ep\").module()\n# model = torch_tensorrt.load(\"trt.ep\").module() # this also works\nmodel(*inputs)\n```\n\n##### Deployment in C++:\n```cpp\n#include \"torch\u002Fscript.h\"\n#include \"torch_tensorrt\u002Ftorch_tensorrt.h\"\n\nauto trt_mod = torch::jit::load(\"trt.ts\");\nauto input_tensor = [...]; \u002F\u002F fill this with your inputs\nauto results = trt_mod.forward({input_tensor});\n```\n\n## Further resources\n- [Double PyTorch Inference Speed for Diffusion Models Using Torch-TensorRT](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fdouble-pytorch-inference-speed-for-diffusion-models-using-torch-tensorrt\u002F)\n- [Up to 50% faster Stable Diffusion inference with one line of code](https:\u002F\u002Fpytorch.org\u002FTensorRT\u002Ftutorials\u002F_rendered_examples\u002Fdynamo\u002Ftorch_compile_stable_diffusion.html#sphx-glr-tutorials-rendered-examples-dynamo-torch-compile-stable-diffusion-py)\n- [Optimize LLMs from Hugging Face with Torch-TensorRT](https:\u002F\u002Fdocs.pytorch.org\u002FTensorRT\u002Ftutorials\u002Fcompile_hf_models.html#compile-hf-models)\n- [Run your model in FP8 with Torch-TensorRT](https:\u002F\u002Fpytorch.org\u002FTensorRT\u002Ftutorials\u002F_rendered_examples\u002Fdynamo\u002Fvgg16_fp8_ptq.html)\n- [Accelerated Inference in PyTorch 2.X with Torch-TensorRT](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=eGDMJ3MY4zk&t=1s)\n- [Tools to resolve graph breaks and boost performance]() \\[coming soon\\]\n- [Tech Talk (GTC '23)](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring23-s51714\u002F)\n- [Documentation](https:\u002F\u002Fnvidia.github.io\u002FTorch-TensorRT\u002F)\n\n\n## Platform Support\n\n| Platform            | Support                                          |\n| ------------------- | ------------------------------------------------ |\n| Linux AMD64 \u002F GPU   | **Supported**                                    |\n| Linux SBSA \u002F GPU    | **Supported**                                    |\n| Windows \u002F GPU       | **Supported (Dynamo only)**                      |\n| Linux Jetson \u002F GPU | **Source Compilation Supported on JetPack-4.4+**  |\n| Linux Jetson \u002F DLA | **Source Compilation Supported on JetPack-4.4+**  |\n| Linux ppc64le \u002F GPU | Not supported                                    |\n\n> Note: Refer [NVIDIA L4T PyTorch NGC container](https:\u002F\u002Fngc.nvidia.com\u002Fcatalog\u002Fcontainers\u002Fnvidia:l4t-pytorch) for PyTorch libraries on JetPack.\n\n### Dependencies\n\nThese are the following dependencies used to verify the testcases. Torch-TensorRT can work with other versions, but the tests are not guaranteed to pass.\n\n- Bazel 8.1.1\n- Libtorch 2.12.0.dev (latest nightly)\n- CUDA 13.0 (CUDA 12.6 on Jetson)\n- TensorRT 10.15.1.29 (TensorRT 10.3 on Jetson)\n\n## Deprecation Policy\n\nDeprecation is used to inform developers that some APIs and tools are no longer recommended for use. Beginning with version 2.3, Torch-TensorRT has the following deprecation policy:\n\nDeprecation notices are communicated in the Release Notes. Deprecated API functions will have a statement in the source documenting when they were deprecated. Deprecated methods and classes will issue deprecation warnings at runtime, if they are used. Torch-TensorRT provides a 6-month migration period after the deprecation. APIs and tools continue to work during the migration period. After the migration period ends, APIs and tools are removed in a manner consistent with semantic versioning.\n\n## Contributing\n\nTake a look at the [CONTRIBUTING.md](CONTRIBUTING.md)\n\n\n## License\n\nThe Torch-TensorRT license can be found in the [LICENSE](.\u002FLICENSE) file. It is licensed with a BSD Style licence\n","\u003Cdiv align=\"center\">\n\nTorch-TensorRT\n===========================\n\u003Ch4> 轻松为 NVIDIA 平台上的任何 PyTorch 模型实现最佳推理性能。 \u003C\u002Fh4>\n\n[![文档](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-master-brightgreen)](https:\u002F\u002Fnvidia.github.io\u002FTorch-TensorRT\u002F)\n[![pytorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.12-green)](https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu130)\n[![cuda](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCUDA-13.0-green)](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads)\n[![trt](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTensorRT-10.14.1-green)](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Ftensorrt)\n[![license](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-BSD--3--Clause-blue)](.\u002FLICENSE)\n[![Linux x86-64 夜间轮子](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-linux-x86_64.yml\u002Fbadge.svg?branch=nightly)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-linux-x86_64.yml)\n[![Linux SBSA 夜间轮子](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-linux-aarch64.yml\u002Fbadge.svg?branch=nightly)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-linux-aarch64.yml)\n[![Windows 夜间轮子](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-windows.yml\u002Fbadge.svg?branch=nightly)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Factions\u002Fworkflows\u002Fbuild-test-windows.yml)\n\n---\n\u003Cdiv align=\"left\">\n\nTorch-TensorRT 将 TensorRT 的强大功能引入 PyTorch。只需一行代码，即可将推理延迟比急切执行模式最高缩短 5 倍。\n\u003C\u002Fdiv>\u003C\u002Fdiv>\n\n## 安装\nTorch-TensorRT 的稳定版本已在 PyPI 上发布：\n```bash\npip install torch-tensorrt\n```\n\nTorch-TensorRT 的夜间版本已在 PyTorch 包索引上发布：\n```bash\npip install --pre torch-tensorrt --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu130\n```\n\nTorch-TensorRT 也包含在开箱即用的 [NVIDIA NGC PyTorch 容器](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fcontainers\u002Fpytorch) 中，该容器预装了所有依赖项及其正确版本，并附带示例笔记本。\n\n如需更高级的安装方法，请参阅 [此处](https:\u002F\u002Fpytorch.org\u002FTensorRT\u002Fgetting_started\u002Finstallation.html)。\n\n## 快速入门\n\n### 方法 1：torch.compile\n您可以在任何使用 `torch.compile` 的地方使用 Torch-TensorRT：\n\n```python\nimport torch\nimport torch_tensorrt\n\nmodel = MyModel().eval().cuda() # 在此处定义您的模型\nx = torch.randn((1, 3, 224, 224)).cuda() # 定义模型的输入形式\n\noptimized_model = torch.compile(model, backend=\"tensorrt\")\noptimized_model(x) # 第一次运行时编译\noptimized_model(x) # 此次运行将非常快速！\n```\n\n### 方法 2：导出\n如果您希望提前优化模型或在 C++ 环境中部署，Torch-TensorRT 提供了一种导出式工作流程，可将优化后的模块序列化。此模块可在 PyTorch 或 libtorch 中部署（即无需 Python 依赖）。\n\n#### 步骤 1：优化 + 序列化\n```python\nimport torch\nimport torch_tensorrt\n\nmodel = MyModel().eval().cuda() # 在此处定义您的模型\ninputs = [torch.randn((1, 3, 224, 224)).cuda()] # 在此处定义一组具有代表性的输入\n\ntrt_gm = torch_tensorrt.compile(model, ir=\"dynamo\", inputs=inputs)\ntorch_tensorrt.save(trt_gm, \"trt.ep\", inputs=inputs) # PyTorch 仅支持 Python 运行时的 ExportedProgram。若要在 C++ 中部署，请使用 TorchScript 文件\ntorch_tensorrt.save(trt_gm, \"trt.ts\", output_format=\"torchscript\", inputs=inputs)\n```\n\n#### 步骤 2：部署\n##### 在 PyTorch 中部署：\n```python\nimport torch\nimport torch_tensorrt\n\ninputs = [torch.randn((1, 3, 224, 224)).cuda()] # 在此处输入您的数据\n\n# 您可以在新的 Python 会话中运行以下代码！\nmodel = torch.export.load(\"trt.ep\").module()\n# model = torch_tensorrt.load(\"trt.ep\").module() # 这种方式同样可行\nmodel(*inputs)\n```\n\n##### 在 C++ 中部署：\n```cpp\n#include \"torch\u002Fscript.h\"\n#include \"torch_tensorrt\u002Ftorch_tensorrt.h\"\n\nauto trt_mod = torch::jit::load(\"trt.ts\");\nauto input_tensor = [...]; \u002F\u002F 用您的输入填充此处\nauto results = trt_mod.forward({input_tensor});\n```\n\n## 更多资源\n- [使用 Torch-TensorRT 将扩散模型的 PyTorch 推理速度提升一倍](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fdouble-pytorch-inference-speed-for-diffusion-models-using-torch-tensorrt\u002F)\n- [只需一行代码，Stable Diffusion 推理速度提升高达 50%](https:\u002F\u002Fpytorch.org\u002FTensorRT\u002Ftutorials\u002F_rendered_examples\u002Fdynamo\u002Ftorch_compile_stable_diffusion.html#sphx-glr-tutorials-rendered-examples-dynamo-torch-compile-stable-diffusion-py)\n- [使用 Torch-TensorRT 优化 Hugging Face 的 LLM](https:\u002F\u002Fdocs.pytorch.org\u002FTensorRT\u002Ftutorials\u002Fcompile_hf_models.html#compile-hf-models)\n- [使用 Torch-TensorRT 以 FP8 格式运行您的模型](https:\u002F\u002Fpytorch.org\u002FTensorRT\u002Ftutorials\u002F_rendered_examples\u002Fdynamo\u002Fvgg16_fp8_ptq.html)\n- [借助 Torch-TensorRT，在 PyTorch 2.X 中加速推理](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=eGDMJ3MY4zk&t=1s)\n- [用于解决图中断并提升性能的工具]() \\[即将推出\\]\n- [技术讲座（GTC '23）](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring23-s51714\u002F)\n- [文档](https:\u002F\u002Fnvidia.github.io\u002FTorch-TensorRT\u002F)\n\n\n## 平台支持\n\n| 平台            | 支持                                          |\n| ------------------- | ------------------------------------------------ |\n| Linux AMD64 \u002F GPU   | **支持**                                    |\n| Linux SBSA \u002F GPU    | **支持**                                    |\n| Windows \u002F GPU       | **支持（仅限 Dynamo）**                      |\n| Linux Jetson \u002F GPU | **JetPack-4.4+ 上支持源码编译**  |\n| Linux Jetson \u002F DLA | **JetPack-4.4+ 上支持源码编译**  |\n| Linux ppc64le \u002F GPU | 不支持                                    |\n\n> 注：有关 JetPack 上的 PyTorch 库，请参考 [NVIDIA L4T PyTorch NGC 容器](https:\u002F\u002Fngc.nvidia.com\u002Fcatalog\u002Fcontainers\u002Fnvidia:l4t-pytorch)。\n\n### 依赖项\n\n以下是用于验证测试用例的依赖项。Torch-TensorRT 可与其他版本配合使用，但无法保证测试一定通过。\n\n- Bazel 8.1.1\n- Libtorch 2.12.0.dev（最新夜间版）\n- CUDA 13.0（Jetson 上为 CUDA 12.6）\n- TensorRT 10.15.1.29（Jetson 上为 TensorRT 10.3）\n\n## 弃用政策\n\n弃用旨在告知开发者某些 API 和工具已不再推荐使用。自版本 2.3 起，Torch-TensorRT 实行如下弃用政策：\n\n弃用通知将在发行说明中公布。被弃用的 API 函数会在源代码中注明其弃用时间。如果在运行时使用已被弃用的方法或类，系统将发出弃用警告。Torch-TensorRT 会在弃用后提供 6 个月的过渡期。在此期间，API 和工具仍可正常使用。过渡期结束后，API 和工具将按照语义版本控制规则被移除。\n\n## 贡献\n请查看 [CONTRIBUTING.md](CONTRIBUTING.md)\n\n## 许可证\n\nTorch-TensorRT 的许可证可在 [LICENSE](.\u002FLICENSE) 文件中找到。它采用 BSD 风格许可证授权。","# Torch-TensorRT 快速上手指南\n\nTorch-TensorRT 是将 NVIDIA TensorRT 的强大功能引入 PyTorch 的工具。只需一行代码，即可在 NVIDIA 平台上将 PyTorch 模型的推理延迟降低高达 5 倍。\n\n## 环境准备\n\n在使用前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**:\n    *   Linux (AMD64\u002Fx86_64 或 SBSA\u002Faarch64) - **完全支持**\n    *   Windows - **仅支持 Dynamo 后端**\n    *   NVIDIA Jetson (JetPack 4.4+) - 支持源码编译\n*   **硬件**: NVIDIA GPU\n*   **核心依赖版本** (推荐测试通过的版本):\n    *   **PyTorch**: 2.12+ (Nightly 版本兼容性更佳)\n    *   **CUDA**: 13.0 (Jetson 平台为 12.6)\n    *   **TensorRT**: 10.14.1+ (Jetson 平台为 10.3+)\n    *   **Libtorch**: 2.12.0.dev+\n\n> **注意**: 对于 Jetson 用户，推荐使用 [NVIDIA L4T PyTorch NGC 容器](https:\u002F\u002Fngc.nvidia.com\u002Fcatalog\u002Fcontainers\u002Fnvidia:l4t-pytorch) 以获得预配置好的环境。\n\n## 安装步骤\n\n### 方式一：安装稳定版 (推荐)\n通过 PyPI 直接安装稳定版本：\n\n```bash\npip install torch-tensorrt\n```\n\n### 方式二：安装夜间构建版 (Nightly)\n如果您需要使用最新的特性或与最新版的 PyTorch\u002FCUDA 配合，可安装夜间构建版：\n\n```bash\npip install --pre torch-tensorrt --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu130\n```\n\n### 方式三：使用 NVIDIA NGC 容器 (最简便)\n无需手动配置依赖，直接拉取包含所有必要组件和示例的官方容器：\n\n```bash\ndocker pull nvcr.io\u002Fnvidia\u002Fpytorch:\u003Ctag>\n# 具体标签请参考 NVIDIA NGC Catalog\n```\n\n## 基本使用\n\nTorch-TensorRT 提供了两种主要的使用模式：**即时编译 (`torch.compile`)** 和 **导出部署 (Export)**。\n\n### 模式一：使用 `torch.compile` (最简单)\n这是最快的上手方式，适用于 Python 环境下的即时加速。只需指定 `backend=\"tensorrt\"`。\n\n```python\nimport torch\nimport torch_tensorrt\n\n# 1. 定义模型并移至 CUDA\nmodel = MyModel().eval().cuda() \n\n# 2. 定义代表性的输入张量 (用于形状推导)\nx = torch.randn((1, 3, 224, 224)).cuda() \n\n# 3. 使用 torch.compile 进行编译，指定 backend 为 tensorrt\noptimized_model = torch.compile(model, backend=\"tensorrt\")\n\n# 4. 首次运行触发编译\noptimized_model(x) \n\n# 5. 后续运行将获得加速效果\noptimized_model(x) \n```\n\n### 模式二：导出与部署 (适用于生产环境)\n如果您需要预先优化模型，或将其部署到 C++ 环境（无 Python 依赖），请使用导出工作流。\n\n#### 第一步：优化并序列化模型\n\n```python\nimport torch\nimport torch_tensorrt\n\nmodel = MyModel().eval().cuda()\ninputs = [torch.randn((1, 3, 224, 224)).cuda()]\n\n# 编译模型\ntrt_gm = torch_tensorrt.compile(model, ir=\"dynamo\", inputs=inputs)\n\n# 保存为 .ep 格式 (仅限 Python 运行时加载)\ntorch_tensorrt.save(trt_gm, \"trt.ep\", inputs=inputs)\n\n# 保存为 TorchScript (.ts) 格式 (可用于 C++ 部署)\ntorch_tensorrt.save(trt_gm, \"trt.ts\", output_format=\"torchscript\", inputs=inputs)\n```\n\n#### 第二步：加载与推理\n\n**在 Python 中加载:**\n```python\nimport torch\nimport torch_tensorrt\n\ninputs = [torch.randn((1, 3, 224, 224)).cuda()]\n\n# 加载导出的程序\nmodel = torch.export.load(\"trt.ep\").module()\n# 或者使用: model = torch_tensorrt.load(\"trt.ep\").module()\n\n# 执行推理\nmodel(*inputs)\n```\n\n**在 C++ 中加载:**\n```cpp\n#include \"torch\u002Fscript.h\"\n#include \"torch_tensorrt\u002Ftorch_tensorrt.h\"\n\n\u002F\u002F 加载 TorchScript 模型\nauto trt_mod = torch::jit::load(\"trt.ts\");\nauto input_tensor = [...]; \u002F\u002F 填入您的输入数据\n\n\u002F\u002F 执行推理\nauto results = trt_mod.forward({input_tensor});\n```","某自动驾驶团队正在将基于 PyTorch 开发的实时道路目标检测模型部署到搭载 NVIDIA Orin 芯片的边缘计算盒上，以满足车辆行驶中的低延迟决策需求。\n\n### 没有 TensorRT 时\n- **推理延迟过高**：模型在 GPU 上以原生 eager 模式运行，单帧图像处理耗时超过 40ms，导致系统整体响应滞后，无法满足实时性要求。\n- **算力资源浪费**：由于缺乏底层算子融合与精度校准，GPU 利用率虽高但有效吞吐量低，难以在同一硬件上并发运行多个感知任务。\n- **部署流程繁琐**：为了追求性能，工程师需手动重写部分网络层或转换模型格式，不仅开发周期长，还容易引入兼容性问题。\n- **动态调整困难**：面对不同光照或天气场景，无法快速对模型进行即时优化，往往需要重新训练或离线转换，灵活性极差。\n\n### 使用 TensorRT 后\n- **延迟显著降低**：仅需一行代码调用 `torch.compile(backend=\"tensorrt\")`，利用算子融合与内核自动调优技术，将单帧推理时间压缩至 8ms 以内，提升近 5 倍速度。\n- **吞吐量大幅提升**：通过 INT8 量化与显存优化，同等硬件下每秒可处理帧数（FPS）成倍增长，轻松支持多路摄像头数据并行分析。\n- **部署极简高效**：支持直接序列化导出为 TorchScript 或 C++ 模块，无需修改原有 PyTorch 代码逻辑，即可在嵌入式环境中无缝运行。\n- **迭代敏捷灵活**：开发人员可在 Python 环境中快速验证不同输入形状下的优化效果，实现“编译即优化”，大幅缩短从算法研发到上车验证的周期。\n\nTensorRT 让复杂的底层 GPU 优化变得像调用普通函数一样简单，真正实现了高性能推理的“开箱即用”。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fpytorch_TensorRT_69c6b96a.png","pytorch","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fpytorch_be722ba8.jpg","",null,"https:\u002F\u002Fpytorch.org","https:\u002F\u002Fgithub.com\u002Fpytorch",[79,83,87,91,95,99,102,106,110,113],{"name":80,"color":81,"percentage":82},"Python","#3572A5",50,{"name":84,"color":85,"percentage":86},"Jupyter Notebook","#DA5B0B",26.4,{"name":88,"color":89,"percentage":90},"C++","#f34b7d",21.3,{"name":92,"color":93,"percentage":94},"Starlark","#76d275",1.2,{"name":96,"color":97,"percentage":98},"CMake","#DA3434",0.3,{"name":100,"color":101,"percentage":98},"Shell","#89e051",{"name":103,"color":104,"percentage":105},"C","#555555",0.2,{"name":107,"color":108,"percentage":109},"Go Template","#00ADD8",0.1,{"name":111,"color":112,"percentage":109},"Dockerfile","#384d54",{"name":114,"color":115,"percentage":116},"CSS","#663399",0,2966,392,"2026-04-19T07:11:16","BSD-3-Clause","Linux, Windows","必需 NVIDIA GPU。Linux (AMD64\u002FSBSA) 和 Windows 均受支持；Jetson 平台需 JetPack-4.4+。CUDA 版本要求：标准环境为 CUDA 13.0（测试验证版），Jetson 环境为 CUDA 12.6。","未说明",{"notes":125,"python":123,"dependencies":126},"1. Windows 平台仅支持 Dynamo 后端。2. Jetson 平台需通过源码编译安装，建议使用 NVIDIA L4T PyTorch NGC 容器。3. 可通过 NVIDIA NGC PyTorch 容器获取预装好所有依赖的环境。4. 支持通过 torch.compile 一键加速或通过导出功能部署到 C++ 环境。",[127,128,129,130],"PyTorch 2.12.0.dev (nightly)","TensorRT 10.15.1.29 (Jetson 上为 10.3)","CUDA 13.0 (Jetson 上为 12.6)","Bazel 8.1.1",[14],[133,134,135,72,136,137,138,139],"tensorrt","libtorch","machine-learning","deep-learning","jetson","nvidia","cuda","2026-03-27T02:49:30.150509","2026-04-20T04:05:12.175194",[143,148,153,158,162,166,170],{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},43782,"如何在 Windows 上安装 torch-tensorrt？","Windows 现在已作为一级支持平台。早期版本中通过 pip 安装或从源码构建可能会遇到错误（如 Bazel 下载失败），但在新版本中这些问题已得到解决。请确保使用最新版本的 torch-tensorrt，并尝试直接使用 `pip install torch-tensorrt` 进行安装。如果仍然遇到问题，请检查是否使用了兼容的 PyTorch 和 CUDA 版本。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fissues\u002F2577",{"id":149,"question_zh":150,"answer_zh":151,"source_url":152},43783,"编译时出现 'Expected ivalues_maps.count(input) to be true but got false' 错误怎么办？","该错误通常与模型未处于评估模式（eval mode）有关。解决方法是在编译前调用 `model.eval()`。示例代码如下：\n\n```python\nmodel.load_state_dict(torch.load(path_trained_pth, map_location=\"cuda:0\"))\nmodel.eval()  # 关键步骤\nsample_input = torch.randn((2, 4, 384, 384)).cuda().float()\ntraced_model = torch.jit.trace(model, sample_input)\ntrt_ts_module = trt.compile(traced_model, inputs=[sample_input], enabled_precisions={torch.float32})\n```\n此外，升级到 torch-tensorrt 1.3.0 或更高版本也可能解决此问题。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fissues\u002F922",{"id":154,"question_zh":155,"answer_zh":156,"source_url":157},43784,"如何在 Jetson 5.0 (L4T R35.x) 上安装 torch-tensorrt？","目前可能没有预编译的 Python wheel 包直接支持 Jetson 5.0。您可以尝试以下两种方法：\n1. **使用提供的 wheel 文件**：下载社区提供的 wheel 文件（如 torch_tensorrt-1.4.0+...linux_aarch64.zip），解压后使用 `pip install file.whl --no-deps` 安装（需先手动安装 torch）。\n2. **在 L4T Docker 容器中编译**：使用 NVIDIA 提供的 L4T Docker 容器（如 l4t-r35.3.1），按照官方文档在容器内从源码编译。注意：如果使用 PyTorch 版本包含 '.nv' 后缀，可能需要应用补丁修复版本解析问题（参考 Issue #2118）。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fissues\u002F1600",{"id":159,"question_zh":160,"answer_zh":161,"source_url":157},43785,"在 Jetson 上编译时遇到 'cannot find -ltorch' 链接错误如何解决？","该错误表明编译器找不到 PyTorch 库。这通常是因为构建环境中的 PyTorch 库架构不匹配或未正确链接。建议在使用 Bazel 构建时，确保使用的是与目标平台（aarch64）兼容的 PyTorch 版本。最可靠的方法是在官方的 L4T Docker 容器中进行编译，因为容器中已经预装了正确版本的依赖库。避免在主机环境中直接交叉编译，除非您非常清楚如何配置工具链。",{"id":163,"question_zh":164,"answer_zh":165,"source_url":152},43786,"Stable Diffusion 的 autoencoder.encode 编译时报错怎么办？","如果在编译 Stable Diffusion 的 autoencoder.encode 部分时遇到类似 'Expected ivalues_maps.count(input) to be true' 的错误，这通常是一个已知问题。解决方法是将 torch-tensorrt 升级到 1.3.0 或更高版本，该版本修复了相关的分区和分析逻辑错误。",{"id":167,"question_zh":168,"answer_zh":169,"source_url":152},43787,"如何将 torchvision 中的 FCOS (with ResNet50) 模型转换为 TensorRT？","转换 torchvision 模型（如 FCOS）时遇到的编译错误通常与模型结构或算子支持有关。首先确保调用了 `model.eval()`。如果问题依然存在，这可能是由于特定的 TorchScript 图结构导致的，建议查看相关的 torchvision issue (#6200) 以获取特定模型的变通方案。通常情况下，确保模型已被正确追踪（trace）或脚本化（script），并且输入形状固定，有助于成功编译。",{"id":171,"question_zh":172,"answer_zh":173,"source_url":157},43788,"在 Jetson 上编译时遇到 PyTorch 版本解析错误（含 .nv 后缀）如何修复？","在 Jetson 平台上，PyTorch 版本号可能包含 '.nv' 后缀（例如 '2.0.0+nvidia'），这会导致 torch-tensorrt 的版本检查逻辑失败。您需要修改 `py\u002Ftorch_tensorrt\u002F__init__.py` 文件，添加一个函数来清理版本号。具体补丁如下：\n\n```python\ndef sanitized_torch_version() -> Any:\n    return (\n        torch.__version__\n        if \".nv\" not in torch.__version__\n        else torch.__version__.split(\".nv\")[0]\n    )\n\n# 将原来的版本比较改为使用清洗后的版本\nif version.parse(sanitized_torch_version()) >= version.parse(\"2.1.dev\"):\n    # ... 后续代码\n```\n应用此补丁后重新编译即可。",[175,180,185,190,195,200,205,210,215,220,225,230,235,240,245,250,255,260,265,270],{"id":176,"version":177,"summary_zh":178,"released_at":179},351201,"v2.11.0","## Torch-TensorRT 2.11.0 Linux x86-64 和 Windows 目标平台 \nPyTorch 2.11、CUDA 12.6\u002F12.8\u002F12.9\u002F13.0、TensorRT 10.15、Python 3.10~3.13\n\n### Torch-TensorRT 的预编译轮子包现已提供：\n**x86-64 Linux 和 Windows：**\n- CUDA 13.0 + Python 3.10–3.13 可通过 PyPI 获取\n  - https:\u002F\u002Fpypi.org\u002Fproject\u002Ftorch-tensorrt\u002F\n\n- CUDA 12.6\u002F12.8\u002F12.9\u002F13.0 + Python 3.10–3.13 也可通过 PyTorch 索引获取\n  - https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch-tensorrt\n\n**aarch64 SBSA Linux 和 Jetson Thor：**\n- CUDA 13.0 + Python 3.10–3.13 + PyTorch 2.11 + TensorRT 10.15\n  - 可通过 PyPI 获取：https:\u002F\u002Fpypi.org\u002Fproject\u002Ftorch-tensorrt\u002F\n  - 也可通过 PyTorch 索引获取：https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch-tensorrt\n\n**Jetson Orin：**\n- 尚未发布适用于 Jetson Orin 的 torch_tensorrt 2.9\u002F2.10\u002F2.11 版本\n- 请继续使用 torch_tensorrt 2.8 版本\n\n\n## Torch-TensorRT-RTX 2.11.0 Linux x86-64 和 Windows 目标平台 \nPyTorch 2.11、CUDA 12.9\u002F13.0、TensorRT-RTX 1.3、Python 3.10~3.13\n\n### Torch-TensorRT-RTX 的预编译轮子包现已提供：\n**x86-64 Linux 和 Windows：**\n- CUDA 13.0 + Python 3.10–3.13 可通过 PyPI 获取\n  - https:\u002F\u002Fpypi.org\u002Fproject\u002Ftorch-tensorrt-rtx\u002F\n\n- CUDA 12.9\u002F13.0 + Python 3.10–3.13 也可通过 PyTorch 索引获取\n  - https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch-tensorrt-rtx\n\n注意：tensorrt-rtx 1.3 的轮子包目前尚未在 PyPI 上发布，因此请从 https:\u002F\u002Fdeveloper.nvidia.com\u002Ftensorrt-rtx 下载 tarball，并从中安装轮子包。\n\n### IAttention 层\n\n在本版本中，TensorRT 的原生 IAttention 层默认用于处理各种与注意力相关的 ATen 操作，包括 SDPA、Flash-SDPA、Efficient-SDPA 和 cuDNN-SDPA。这种集成能够实现更高效的执行，并提升模型性能。要显式启用此行为，可在 `compile()` 函数中设置 `decompose_attention=False`。启用后，将使用 TensorRT 的原生实现来优化注意力计算。然而，由于当前 TensorRT 的限制，某些操作如 `compute_log_sumexp` 和 `分组查询注意力 (GQA)` 尚不支持。如果遇到这些情况，编译时会显示一条提示信息。此外，您也可以将 `decompose_attention` 设置为 `True`，以将注意力操作分解为多个基础的 ATen 操作。尽管这种方法可能无法达到相同的性能优化效果，但它能够覆盖更广泛的算子，并在不同模型架构之间提供更高的兼容性。\n\n### 符号形状系统的改进\n\n针对用于跟踪图体中动态维度变化的符号形状系统，我们进行了两项关键改进：\n\n1. 每个已编译的引擎都会将形状传播公式作为元数据记录下来。\n\n现在，我们无需实例化引擎即可完成序列化、重新追踪等需要假张量传播的关键任务；相反，在编译时，我们会为每个 TRT 子图记录输入和输出之间的形状关系，并……","2026-04-07T17:03:57",{"id":181,"version":182,"summary_zh":183,"released_at":184},351202,"v2.10.0","## Torch-TensorRT 2.10.0 Linux x86-64 和 Windows 目标平台 \nPyTorch 2.10、CUDA 12.9、13.0、TensorRT 10.14、Python 3.10~3.13\n\n### Torch-TensorRT 的预编译二进制包现已提供：\n**x86-64 Linux 和 Windows：**\n- CUDA 13.0 + Python 3.10–3.13 可通过 PyPI 获取\n  - https:\u002F\u002Fpypi.org\u002Fproject\u002Ftorch-tensorrt\u002F\n\n- CUDA 12.9\u002F13.0 + Python 3.10–3.13 也可通过 PyTorch 索引获取\n  - https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch-tensorrt\n\n**aarch64 SBSA Linux 和 Jetson Thor**\n- CUDA 13.0 + Python 3.10–3.13 + PyTorch 2.10 + TensorRT 10.14\n  - 可通过 PyPI 获取：https:\u002F\u002Fpypi.org\u002Fproject\u002Ftorch-tensorrt\u002F\n  - 可通过 PyTorch 索引获取：https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch-tensorrt\n\n**Jetson Orin**\n- 尚未发布适用于 Jetson Orin 的 torch_tensorrt 2.9\u002F2.10 版本\n- 请继续使用 torch_tensorrt 2.8 版本\n\n## 重要变更\n\n默认情况下，使用 `torch_tensorrt.save` 保存编译后的图模块时，会启用重追踪功能。Torch-TensorRT 会使用 `torch.export.export(strict=False)` 重新导出图以进行保存，从而保留输出 FX 图的完整性并填充元数据。\n\n## 新特性\n\n### LLM 改进\n`run_llm` 脚本现支持编译此前已使用 TensorRT 模型优化工具包量化并上传至 HuggingFace 的模型。\n\n目前我们支持以下推理场景：\n\n#### 标准高精度模型，直接通过 torch_tensorrt Autocast 以 fp16\u002Fbf16 编译并运行推理\n```python\npython run_llm.py --model Qwen\u002FQwen2.5-0.5B-Instruct \\\n--prompt \"什么是并行编程？\" \\\n--model_precision FP16 --num_tokens 128 \\\n--cache static_v2 --enable_pytorch_run\n```\n\n#### 标准高精度模型，使用 TensorRT 模型优化器在设备上量化并编译，然后以 fp8\u002Fnvfp4 精度运行推理\n```python\npython run_llm.py --model Qwen\u002FQwen2.5-0.5B-Instruct  \\\n--prompt \"什么是并行编程？\"  \\\n--model_precision FP16 \\\n--quant_format fp8 --num_tokens 128 \\\n--cache static_v2 --enable_pytorch_run\n```\n\n#### 已经量化并上传至 HuggingFace 的模型，直接以 fp8\u002Fnvfp4 进行编译和推理\n```python\npython run_llm.py --model nvidia\u002FQwen3-8B-FP8 \\\n--prompt \"什么是并行编程？\"  \\\n--model_precision FP16 \\\n--quant_format fp8 \\\n--num_tokens 128 \\\n--cache static_v2 --enable_pytorch_run\n```\n\n**注意事项：**\n- `--model_precision`\n  - 此参数为必填项，用于告知 LLM 工具模型的精度。\n- `--quant_format`\n  - 此参数为可选项，仅用于量化模型的推理。\n  - 对于预先量化的 modelopt 检查点，此参数用于指示……\n\n### 引擎缓存改进\n在此版本之前，由于 TensorRT（\u003C10.14）的限制，去权重引擎只能重新适配一次，因此我们曾缓存带权重的引擎以确保引擎缓存功能正常工作，但这占用了不必要的硬盘空间。自本版本起，如果用户安装了 TensorRT ≥10.14，则引擎缓存将仅在磁盘上保存去权重的引擎。","2026-02-20T02:13:45",{"id":186,"version":187,"summary_zh":188,"released_at":189},351203,"v2.9.0","## PyTorch 2.9、CUDA 13.0 TensorRT 10.13、Python 3.13\n\nTorch-TensorRT 2.9.0 的 Linux x86-64 和 Windows 版本分别针对 PyTorch 2.9、TensorRT 10.13、CUDA 13.0、12.8、12.6 以及 Python 3.10 至 3.13。\n\n### Python\n\n#### x86-64 Linux 和 Windows\n\n- 通过 PyPI 提供 CUDA 13.0 + Python 3.10–3.13 的版本\n  - https:\u002F\u002Fpypi.org\u002Fproject\u002Ftorch-tensorrt\u002F\n\n- 通过 PyTorch 索引也提供 CUDA 12.6\u002F12.8\u002F13.0 + Python 3.10–3.13 的版本\n  - https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch-tensorrt\n\n#### aarch64 SBSA Linux 和 Jetson Thor\n\n- CUDA 13.0 + Python 3.10–3.13 + PyTorch 2.9 + TensorRT 10.13（其中 Python 3.12 是唯一经过 Thor 平台验证的版本）\n  - 可通过 PyPI 获取：https:\u002F\u002Fpypi.org\u002Fproject\u002Ftorch-tensorrt\u002F\n  - 也可通过 PyTorch 索引获取：https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch-tensorrt\n\n> 注意：在 aarch64 平台上，必须显式安装 TensorRT，或使用系统已安装的 TensorRT 轮子包。\n\n```sh\nuv pip install torch torch-tensorrt tensorrt \n```\n\n#### aarch64 Jetson Orin\n\n- 目前尚无适用于 Jetson Orin 的 torch_tensorrt 2.9 版本，请继续使用 torch_tensorrt 2.8 版本。\n\n### C++\n\n#### x86-64 Linux 和 Windows\n\n- CUDA 13.0 的 Tarball \u002F Zip 包\n\n### 已弃用功能\n\n#### FX 前端\n\nFX 前端是 Dynamo 前端的前身，两者共享许多组件。如今，Dynamo 前端已经稳定，并且所有共享组件均已解耦。因此，自 2026 年上半年起，我们将不再在二进制发行版中包含 FX 前端。不过，FX 前端仍保留在源代码树中，以便在必要时可通过源码构建重新安装该前端。\n\n## 新特性\n\n### LLM 和 VLM 改进\n\n在本次发布中，我们引入了多项关键改进：\n- SDPA 转换器中的滑动窗口注意力：新增对滑动窗口注意力的支持，成功编译了 Gemma3 模型（Gemma3-1B）。\n- 动态自定义降级传递：重构了降级框架，允许用户根据 Hugging Face 模型的配置动态注册自定义传递。\n- 视觉-语言模型（VLM）支持：\n  - 通过新的 run_vlm.py 工具，新增对 Eagle2 和 Qwen2.5-VL 模型的支持。\n  - run_vlm.py 可以编译 VLM 模型的视觉和语言组件，并支持 KV 缓存，从而实现高效的 VLM 生成。\n\n有关运行这些模型的详细说明，请参阅[文档](https:\u002F\u002Fdocs.pytorch.org\u002FTensorRT\u002Ftutorials\u002Fcompile_hf_models.html)。\n\n### TensorRT-RTX\n\nTensorRT-RTX 是一种以 JIT 优先的 TensorRT 版本。与传统 TensorRT 在构建阶段进行策略选择和融合不同，TensorRT-RTX 允许在针对特定硬件进行优化之前分发构建结果，从而可以将一个与 GPU 无关的软件包分发给所有用户。随后，在首次使用时，TensorRT RTX 会根据用户所使用的具体硬件进行调优。Torch-TensorRT-RTX 是 Torch-Tenso 的一个构建版本。","2025-10-17T15:58:11",{"id":191,"version":192,"summary_zh":193,"released_at":194},351204,"v2.8.0","PyTorch 2.8、CUDA 12.8 TensorRT 10.12、Python 3.13\n\nTorch-TensorRT 2.8.0 标准版适用于 Linux x86-64 和 Windows 平台，支持 PyTorch 2.8、TensorRT 10.12、CUDA 12.6、12.8、12.9 以及 Python 3.9 至 3.13。\n\n- Linux x86-64 + Windows\n   - CUDA 12.8 + Python 3.9–3.13 可通过 PyPI 获取：https:\u002F\u002Fpypi.org\u002Fproject\u002Ftorch-tensorrt\u002F\n   - CUDA 12.6\u002F12.8\u002F12.9 + Python 3.9–3.13 也可通过 PyTorch 索引获取：https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch-tensorrt\n\n#### 平台支持\n\n除了标准的 Windows x86-64 和 Linux x86-64 版本外，我们现在还为 SBSA 和 Jetson 提供二进制构建：\n\n- SBSA aarch64\n   - CUDA 12.9 + Python 3.9–3.13 + Torch 2.8 + TensorRT 10.12\n   - 可通过 PyPI 获取：https:\u002F\u002Fpypi.org\u002Fproject\u002Ftorch-tensorrt\u002F\n   - 也可通过 PyTorch 索引获取：https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch-tensorrt\n\n- Jetson Orin\n   - CUDA 12.6 + Python 3.10 + Torch 2.8 + TensorRT 10.3.0\n   - 可在 https:\u002F\u002Fpypi.jetson-ai-lab.io\u002Fjp6\u002Fcu126 获取\n\n#### 已弃用功能\n\n- TensorRT 的隐式量化支持自 TensorRT 10.1 起已被弃用。与 INT8Calibrator 相关的 Torch-TensorRT API 将在 Torch-TensorRT 2.9.0 中移除。建议量化用户迁移到基于 TensorRT 模型优化工具包的工作流。更多信息请参见：https:\u002F\u002Fdocs.pytorch.org\u002FTensorRT\u002Ftutorials\u002F_rendered_examples\u002Fdynamo\u002Fvgg16_ptq.html\n\n## 新特性\n\n### AOT-Inductor 无 Python 部署\n**稳定性：测试版**\n\n传统上，TorchScript 一直被用于在非 Python 解释器环境中运行 Torch-TensorRT 程序。无论是 `dynamo`\u002F`torch.compile` 前端，还是 TorchScript 前端，都支持这种基于 TorchScript 的部署方式。\n\n##### 旧方法\n\n```py\ntrt_model = torch_tensorrt.compile(model, ir=\"dynamo\", arg_inputs=[...])\nts_model = torch.jit.trace(trt_model, inputs=[...])\nts_model.save(\"trt_model.ts\")\n```\n\n现在，您可以通过 AOT-Inductor 实现类似的效果。AOTInductor 是 [TorchInductor](https:\u002F\u002Fdev-discuss.pytorch.org\u002Ft\u002Ftorchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes\u002F747) 的一个专用版本，专为处理导出的 PyTorch 模型而设计，能够对其进行优化，并生成共享库及其他相关工件。这些编译后的工件特别适合在非 Python 环境中部署。\n\nTorch-TensorRT 可以将 TensorRT 引擎嵌入到 AOT-Inductor 库中，从而进一步加速模型。此外，您还可以通过这种方式将 Inductor 内核与 TensorRT 引擎结合使用。这使得用户能够利用 torch-compile 的原生技术，在非 Python 环境中部署其模型。\n\n##### 新方法\n\n```py\nwith torch.no_grad():\n    cg_trt_module = torch_tensorrt.compile(model, **compile_settings)\n    torch_tensorrt.save(\n        cg_trt_module,\n        file_path=os.path.join(os.getcwd(), \"model.pt2\"),\n        output_format=\"aot_inductor\",\n        retrace=True,\n        arg_inputs=example_inputs,\n    )\n```\n\n随后，这个 `model.pt2` 文件可以在 Python 环境中加载。","2025-08-09T06:02:38",{"id":196,"version":197,"summary_zh":198,"released_at":199},351205,"v2.6.1","## 变更内容\n* 由 @lanluo-nvidia 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F3540 中移除断点\n* 由 @lanluo-nvidia 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F3542 中修复 patch2.6.1 的构建问题\n* 由 @lanluo-nvidia 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F3545 中将版本更新至 2.6.1\n* 由 @lanluo-nvidia 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F3547 中将 3505（Windows 驱动升级）合入 release2.6.1\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fcompare\u002Fv2.6.0...v2.6.1","2025-06-03T04:49:20",{"id":201,"version":202,"summary_zh":203,"released_at":204},351206,"v2.7.0","## PyTorch 2.7、CUDA 12.8、TensorRT 10.9、Python 3.13\n\nTorch-TensorRT 2.7.0 针对 PyTorch 2.7、TensorRT 10.9 和 CUDA 12.8 构建，（适用于 CUDA 11.8\u002F12.4 的构建可通过 PyTorch 软件包索引获取——https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118 和 https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124）。支持 Python 3.9 至 3.13 版本。我们不再提供基于旧 cxx11 ABI 的构建，所有 wheel 和 tarball 文件都将使用 cxx11 ABI。\n\n#### 已知问题\n\n- 在 Python 3.13 中，引擎重适配功能已被禁用。\n\n### 使用自动插件生成在 TensorRT 引擎中调用自定义内核\n\n用户可以使用 OpenAI Triton 等 DSL 开发自己的自定义内核。通过 PyTorch 自定义操作和 Torch-TensorRT 自动插件生成功能，这些内核可以在 TensorRT 引擎中被调用，且只需极少的额外代码。\n\n```py\n@triton.jit\ndef elementwise_scale_mul_kernel(X, Y, Z, a, b, BLOCK_SIZE: tl.constexpr):\n    pid = tl.program_id(0)\n    # 计算该线程块将处理的元素范围\n    block_start = pid * BLOCK_SIZE\n    # 当前线程将处理的索引范围\n    offsets = block_start + tl.arange(0, BLOCK_SIZE)\n    # 从 X 和 Y 张量中加载元素\n    x_vals = tl.load(X + offsets)\n    y_vals = tl.load(Y + offsets)\n    # 执行逐元素乘法\n    z_vals = x_vals * y_vals * a + b\n    # 将结果存储到 Z 中\n    tl.store(Z + offsets, z_vals)\n\n\n@torch.library.custom_op(\"torchtrt_ex::elementwise_scale_mul\", mutates_args=())  # type: ignore[misc]\ndef elementwise_scale_mul(\n    X: torch.Tensor, Y: torch.Tensor, b: float = 0.2, a: int = 2\n) -> torch.Tensor:\n    # 确保张量位于 GPU 上\n    assert X.is_cuda and Y.is_cuda, \"张量必须位于 CUDA 设备上。\"\n    assert X.shape == Y.shape, \"张量必须具有相同的形状。\"\n\n    # 创建输出张量\n    Z = torch.empty_like(X)\n\n    # 定义块大小\n    BLOCK_SIZE = 1024\n\n    # 程序网格\n    grid = lambda meta: (X.numel() \u002F\u002F meta[\"BLOCK_SIZE\"],)\n\n    # 使用参数 a 和 b 启动内核\n    elementwise_scale_mul_kernel[grid](X, Y, Z, a, b, BLOCK_SIZE=BLOCK_SIZE)\n\n    return Z\n\n@torch.library.register_fake(\"torchtrt_ex::elementwise_scale_mul\")\ndef _(x: torch.Tensor, y: torch.Tensor, b: float = 0.2, a: int = 2) -> torch.Tensor:\n    return x\n\ntorch_tensorrt.dynamo.conversion.plugins.custom_op(\"torchtrt_ex::elementwise_scale_mul\", supports_dynamic_shapes=True, requires_output_allocator=False)\n\ntrt_mod_w_kernel = torch_tensorrt.compile(module, ...) \n```\n\n`torch_tensorrt.dynamo.conversion.plugins.custom_op` 将使用 [Quick Deploy 插件系统](https:\u002F\u002Fdocs.nvidia.com\u002Fdeeplearning\u002Ftensorrt\u002Flatest\u002F_static\u002Fpython-api\u002Finfer\u002Ftensorrt.plugin\u002Ftrt_plugin_register.html#tensorrt.plugin.register) 并借助 [PyTorch 的 FakeTensor 模式](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Ftorch.compiler_fake_tensor.html)，通过复用注册所需的信息来生成一个 TensorRT 插件。","2025-05-07T23:44:29",{"id":206,"version":207,"summary_zh":208,"released_at":209},351207,"v2.6.0","## PyTorch 2.6、CUDA 12.6 TensorRT 10.7、Python 3.12\n\nTorch-TensorRT 2.6.0 针对 PyTorch 2.6、TensorRT 10.7 和 CUDA 12.6 进行优化。（支持 CUDA 11.8\u002F12.4 的构建可通过 PyTorch 软件包索引获取——https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118 和 https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124）。支持 Python 3.9 至 3.12 版本。由于当前 TensorRT 尚不支持 Python 3.13，因此本版本不提供对该版本的支持。\n\n### 弃用通知\n\n在 v2.6 中，torchscript 前端将被弃用。具体而言，以下用法将不再受支持，并且在运行时使用时会发出弃用警告：\n\n```py\ntorch_tensorrt.compile(model, ir=\"torchscript\")\n```\n\n今后，我们鼓励用户迁移到以下受支持的选项之一：\n\n```py\ntorch_tensorrt.compile(model)\ntorch_tensorrt.compile(model, ir=\"dynamo\")\ntorch.compile(model, backend=\"tensorrt\")\n```\n\ntorchscript 仍将继续作为部署格式受到支持，方法是通过编译后的追踪实现：\n\n```py\ndynamo_model = torch_tensorrt.compile(model, ir=\"dynamo\", arg_inputs=[...])\nts_model = torch.jit.trace(dynamo_model, inputs=[...])\nts_model(...)\n```\n\n有关弃用政策的更多信息，请参阅 README 文件。\n\n### 跨操作系统编译\n\n在 Torch-TensorRT 2.6 中，现在可以使用 Linux 主机通过 `torch_tensorrt.cross_compile_for_windows` API 为 Windows 编译 Torch-TensorRT 程序。这些程序采用略有不同的序列化格式以支持此工作流程，因此无法在 Linux 上运行。因此，在调用 `torch_tensorrt.cross_compile_for_windows` 时，程序将直接保存到磁盘。开发人员随后应在 Windows 目标平台上使用 `torch_tensorrt.load_cross_compiled_exported_program` 来加载序列化的程序。Torch-TensorRT 程序现在包含目标平台信息，以便在反序列化时验证操作系统兼容性。这也导致运行时 ABI 发生了变化。\n\n```py\nif load:\n    # 在 Windows 上加载已保存的模型\n    if platform.system() != \"Windows\" 或 platform.machine() != \"AMD64\":\n        raise ValueError(\n            \"跨运行时编译的 Windows 模型只能在 Windows 系统上加载\"\n        )\n    loaded_model = torchtrt.load_cross_compiled_exported_program(save_path).module()\n    print(f\"模型已成功从 ${save_path} 加载\")\n    # 推理\n    trt_output = loaded_model(input)\n    print(f\"推理结果：{trt_output}\")\nelse:\n    if platform.system() != \"Linux\" 或 platform.architecture()[0] != \"64bit\":\n        raise ValueError(\n            \"跨运行时编译的 Windows 模型只能在 Linux 系统上编译\"\n        )\n    compile_spec = {\n        \"debug\": True,\n        \"min_block_size\": 1,\n    }\n    torchtrt.cross_compile_for_windows(\n        model, file_path=save_path, inputs=inputs, **compile_spec\n    )\n    print(\n        f\"模型已成功跨运行时编译并保","2025-02-05T22:03:37",{"id":211,"version":212,"summary_zh":213,"released_at":214},351208,"v2.5.0","## PyTorch 2.5、CUDA 12.4、TensorRT 10.3、Python 3.12\n\nTorch-TensorRT 2.5.0 针对 PyTorch 2.5、TensorRT 10.3 和 CUDA 12.4。\n（针对 CUDA 11.8\u002F12.1 的构建可通过 PyTorch 软件包索引获取 - https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118 https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121）\n\n### 弃用通知\n\ntorchscript 前端将在 v2.6 中被弃用。具体而言，以下用法将不再受支持，并且在运行时使用时会发出弃用警告：\n\n```py\ntorch_tensorrt.compile(model, ir=\"torchscript\")\n```\n\n今后，我们鼓励用户切换到以下受支持的选项之一：\n\n```py\ntorch_tensorrt.compile(model)\ntorch_tensorrt.compile(model, ir=\"dynamo\")\ntorch.compile(model, backend=\"tensorrt\")\n```\n\ntorchscript 将继续作为部署格式受到支持，方法是通过编译后的追踪：\n\n```py\ndynamo_model = torch_tensorrt.compile(model, ir=\"dynamo\", arg_inputs=[...])\nts_model = torch.jit.trace(dynamo_model, inputs=[...])\nts_model(...)\n```\n\n有关我们的弃用政策的更多信息，请参阅 README 文件。\n\n### 重调优（Beta 版）\n\nv2.5.0 引入了针对已编译 Torch-TensorRT 程序的直接模型重调优功能。有时，在推理过程中需要更改权重，而过去必须通过完全重新编译来替换模型权重，这可以通过 `torch.compile` 的自动重新编译或通过 `torch_tensorrt.compile` 的手动重新编译来实现。现在，借助 `refit_module_weights` API，只需提供一个包含新权重且结构相同的全新 PyTorch 模块，即可对已编译模块进行重调优。要使用此功能，已编译模块必须在编译时启用 `make_refittable` 选项。\n\n```py\n# 创建并导出更新后的模型\nmodel2 = models.resnet18(pretrained=True).eval().to(\"cuda\")\nexp_program2 = torch.export.export(model2, tuple(inputs))\n\n\ncompiled_trt_ep = torch_trt.load(\".\u002Fcompiled.ep\")\n\n# 这将返回一个带有更新权重的新模块\nnew_trt_gm = refit_module_weights(\n    compiled_module=compiled_trt_ep,\n    new_weight_module=exp_program2,\n)\n```\n\n有一些操作与重调优不兼容，例如使用 `ILoop layer` 的操作。当启用 `make_refittable` 时，这些操作将被迫在 PyTorch 中运行。此外，需要注意的是，启用了重调优功能的引擎性能可能会略低于未启用重调优的引擎，因为 TensorRT 无法针对执行时将看到的具体权重进行优化。\n\n#### 重调优缓存（实验性）\n\n单独使用重调优功能可以将模型更新和切换时间加快 0.5 到 2 倍。然而，通过利用重调优缓存，还可以进一步提升重调优的速度。在编译时启用的重调优缓存会将 PyTorch 模块成员与 TRT 层名称之间的直接映射提示存储在 `TorchTensorRTModule` 的元数据中。这种缓存可以使重调优速度提升几个数量级。不过，目前在处理某些层时仍存在一些限制。","2024-10-18T19:45:24",{"id":216,"version":217,"summary_zh":218,"released_at":219},351209,"v2.4.0","## Windows 平台上的 C++ 运行时支持、转换器中增强的动态形状支持、PyTorch 2.4、CUDA 12.4、TensorRT 10.1、Python 3.12\n\nTorch-TensorRT 2.4.0 针对 PyTorch 2.4、CUDA 12.4（针对 CUDA 11.8\u002F12.1 的构建可通过 PyTorch 软件包索引获取——https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118 和 https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121）以及 TensorRT 10.1。  \n此版本正式引入了对 Windows 平台上 C++ 运行时的支持，不过目前仅限于 Dynamo 前端，同时支持 AOT 和 JIT 工作流。用户现在可以在 Windows 上同时使用 Python 和 C++ 运行时。此外，该版本还扩展了对所有 Aten 核心算子的支持，仅除 `torch.nonzero` 外，并显著提升了更多转换器对动态形状的支持能力。本版本首次支持 Python 3.12。\n\n### 完整的 Windows 支持\n在本次发布中，我们同时引入了 Windows 平台上的 C++ 和 Python 运行时支持。用户现在可以直接在 Windows 上使用 TensorRT 优化 PyTorch 模型，无需任何代码修改。C++ 运行时为默认选项，用户可以通过指定 `use_python_runtime=True` 来启用 Python 运行时。\n\n```py\nimport torch\nimport torch_tensorrt\nimport torchvision.models as models\n\nmodel = models.resnet18(pretrained=True).eval().to(\"cuda\")\ninput = torch.randn((1, 3, 224, 224)).to(\"cuda\")\ntrt_mod = torch_tensorrt.compile(model, ir=\"dynamo\", inputs=[input])\ntrt_mod(input)\n```\n\n### 转换器中增强的算子支持\n\n目前，转换器对核心 Aten 算子的支持率已接近 100%。此时若发生回退到 PyTorch 执行的情况，通常是由转换器本身的特定限制，或用户编译器设置（如 `torch_executed_ops`、动态形状等）共同导致的。本版本还进一步扩展了支持动态形状的算子数量。“dryrun” 将为您提供关于您的模型和设置的具体支持信息。\n\n\n## 变更内容\n* 修复：@gs-olive 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F2669 中修复了 `get_attr` 调用中出现 FakeTensor 的问题。\n* 新特性：@zewenli98 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F2614 中实现了对 `adaptive_avg_pool1d` 的 Dynamo 转换器支持。\n* 修复：@Arktische 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F2672 中为 core_lowering.passes 添加了缺失的 CMake 源文件引用。\n* CI：@gs-olive 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F2704 中将 Torch nightly 版本升级至 `2.4.0`。\n* 新增：@HolyWu 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F2696 中为 `aten.pixel_unshuffle` 的 Dynamo 转换器添加了支持。\n* 新特性：@chohk88 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F2689 中实现了对 `aten.atan2` 的转换器支持。\n* 新特性：@chohk88 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F2710 中实现了对 `aten.index_select` 的转换器支持。\n* 新特性：@chohk88 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F2711 中实现了对 `aten.isnan` 的转换器支持。\n* 新特性：@zewenli98 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F2632 中实现了对 2D 和 3D 自适应平均池化操作的 Dynamo 转换器支持。\n* 新特性：@chohk88 在 https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTenso","2024-07-29T21:50:55",{"id":221,"version":222,"summary_zh":223,"released_at":224},351210,"v2.3.0","## Dynamo 中的 Windows 支持、动态形状与量化，PyTorch 2.3、CUDA 12.1、TensorRT 10.0\n\nTorch-TensorRT 2.3.0 针对 PyTorch 2.3、CUDA 12.1（可通过 PyTorch 软件包索引获取 CUDA 11.8 的构建版本——https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118）以及 TensorRT 10.0。2.3.0 版本新增了对 Windows 平台的官方支持。在 Windows 上，仅支持使用 Dynamo 前端，且目前用户必须使用纯 Python 运行时（对 C++ 运行时的支持将在后续版本中添加）。此外，该版本还增加了无需重新编译即可支持动态形状的功能。现在，用户还可以通过 Model Optimizer 工具包（https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-Model-Optimizer）将量化模型与 Torch-TensorRT 结合使用。\n\n> 注意：Python 3.12 不受支持，因为 PyTorch 2.3.0 中的 Dynamo 堆栈不兼容 Python 3.12。\n\n### Windows\n在此版本中，我们引入了使用 Dynamo 路径的 Python 运行时对 Windows 的支持。用户现在可以在 Windows 上直接使用 TensorRT 优化 PyTorch 模型，且只需进行极少量的代码修改。这一集成使得在 Torch-TensorRT 的 Dynamo 编译路径中（`ir=\"dynamo\"` 和 `ir=\"torch_compile\"`）实现纯 Python 优化成为可能。\n\n```python\nimport torch\nimport torch_tensorrt\nimport torchvision.models as models\n\nmodel = models.resnet18(pretrained=True).eval().to(\"cuda\")\ninput = torch.randn((1, 3, 224, 224)).to(\"cuda\")\ntrt_mod = torch_tensorrt.compile(model, ir=\"dynamo\", inputs=[input])\ntrt_mod(input)\n```\n\n### Dynamo 中的动态形状模型编译\n在 v2.3.0 版本中，动态形状支持变得更加稳健。Torch-TensorRT 现在会利用图中的符号信息来计算中间形状范围，从而支持更多动态形状场景。对于使用 `torch.export` 的 AOT 工作流，无需任何更改即可利用这些新特性。而对于之前依赖 `torch.compile` 的守卫机制，在输入尺寸变化时自动重新编译引擎的 JIT 工作流，用户现在可以使用 PyTorch API 标记动态维度（https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Ftorch.compiler_dynamic_shapes.html）。使用这些 API 后，只要输入不违反指定的约束条件，引擎就不会重新编译。\n\n##### AOT 工作流\n```python\nimport torch\nimport torch_tensorrt \ncompile_spec = {\"inputs\": [torch_tensorrt.Input(min_shape=(1, 3, 224, 224), \n                              opt_shape=(4, 3, 224, 224), \n                              max_shape=(8, 3, 224, 224), \n                              dtype=torch.float32)], \n                              \"enabled_precisions\": {torch.float}}\ntrt_model = torch_tensorrt.compile(model, **compile_spec)\n```\n\n##### JIT 工作流\n```python\nimport torch\nimport torch_tensorrt\ncompile_spec = {\"enabled_precisions\": {torch.float}}\ninputs = torch.randn((4, 3, 224, 224)).to(\"cuda\")\n# 这表示第 0 维是动态的，其范围为 [1, 8]\ntorch._dynamo.mark_dynamic(inputs, 0, min=1, max=8)\ntrt_model = torch.compile(model, ba","2024-06-07T21:49:43",{"id":226,"version":227,"summary_zh":228,"released_at":229},351211,"v2.2.0","## Dynamo Frontend for Torch-TensorRT, PyTorch 2.2, CUDA 12.1, TensorRT 8.6\r\n\r\nTorch-TensorRT 2.2.0 targets PyTorch 2.2, CUDA 12.1 (builds for CUDA 11.8 are available via the PyTorch package index - https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118) and TensorRT 8.6. This release is the second major release of Torch-TensorRT as the default frontend has changed from TorchScript to Dynamo allowing for users to more easily control and customize the compiler in Python. \r\n\r\nThe dynamo frontend can support both JIT workflows through `torch.compile` and AOT workflows through `torch.export + torch_tensorrt.compile`. It targets the __Core ATen Opset__ (https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Ftorch.compiler_ir.html#core-aten-ir) and currently has 82% coverage. Just like in Torchscript graphs will be partitioned based on the ability to map operators to TensorRT in addition to any graph surgery done in Dynamo.  \r\n\r\n#### Output Format\r\nThrough the Dynamo frontend, different output formats can be selected for AOT workflows via the `output_format` kwarg. The choices are `torchscript` where the resulting compiled module will be traced with **`torch.jit.trace`**, suitable for Pythonless deployments, `exported_program` a new serializable format for PyTorch models or finally if you would like to run further graph transformations on the resultant model, `graph_module` will return a `torch.fx.GraphModule`.\r\n\r\n### Multi-GPU Safety \r\n\r\nTo address a long standing source of overhead, single GPU systems will now operate without typical required device checks. This check can be re-added when multiple GPUs are available to the host process using `torch_tensorrt.runtime.set_multi_device_safe_mode`\r\n\r\n```py\r\n# Enables Multi Device Safe Mode\r\ntorch_tensorrt.runtime.set_multi_device_safe_mode(True)\r\n\r\n# Disables Multi Device Safe Mode [Default Behavior]\r\ntorch_tensorrt.runtime.set_multi_device_safe_mode(False)\r\n\r\n# Enables Multi Device Safe Mode, then resets the safe mode to its prior setting\r\nwith torch_tensorrt.runtime.set_multi_device_safe_mode(True):\r\n    ...\r\n```\r\n\r\nMore information can be found here: https:\u002F\u002Fpytorch.org\u002FTensorRT\u002Fuser_guide\u002Fruntime.html\r\n\r\n### Capability Validators\r\n\r\nIn the Dynamo frontend, tests can be written and associated with converters to dynamically enable or disable them based on conditions in the target graph. \r\n\r\nFor example, the convolution converter in dynamo only supports 1D, 2D, and 3D convolution. We can therefore create a lambda which given a convolution FX node can determine if the convolution is supported: \r\n\r\n```py \r\n@dynamo_tensorrt_converter(\r\n    torch.ops.aten.convolution.default, \r\n     capability_validator=lambda conv_node: conv_node.args[7] in ([0], [0, 0], [0, 0, 0])\r\n)  # type: ignore[misc]\r\ndef aten_ops_convolution(\r\n    ctx: ConversionContext,\r\n    target: Target,\r\n    args: Tuple[Argument, ...],\r\n    kwargs: Dict[str, Argument],\r\n    name: str,\r\n) -> Union[TRTTensor, Sequence[TRTTensor]]:\r\n```\r\nIn such a case where the `Node` is not supported, the node will be partitioned out and run in PyTorch. \r\nAll capability validators are run prior to partitioning, after the lowering phase. \r\n\r\nMore information on writing converters for the Dynamo frontend can be found here: https:\u002F\u002Fpytorch.org\u002FTensorRT\u002Fcontributors\u002Fdynamo_converters.html\r\n\r\n### Breaking Changes\r\n\r\n- Dynamo (torch.export) is now the default frontend for Torch-TensorRT. The TorchScript and FX frontends are now in maintenance mode. Therefore any `torch.nn.Module`s or `torch.fx.GraphModule`s provided to `torch_tensorrt.compile` will by default be exported using `torch.export` then compiled. This default can be overridden by setting the `ir=[torchscript|fx]` kwarg. Any bugs reported will first be attempted to be resolved in the dynamo stack before attempting other frontends however pull requests for additional functionally in the TorchScript and FX frontends from the community will still be accepted.  \r\n\r\n## What's Changed\r\n* chore: Update Torch and Torch-TRT versions and docs on `main` by @gs-olive in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1784\r\n* fix: Repair invalid schema arising from lowering pass by @gs-olive in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1786\r\n* fix: Allow full model compilation with collection inputs (`input_signature`) by @gs-olive in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1656\r\n* feat(\u002F\u002Fcore\u002Fconversion): Add support for aten::size with dynamic shaped models for Torchscript backend. by @peri044 in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1647\r\n* feat: add support for aten::baddbmm by @mfeliz-cruise in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1806\r\n* [feat] Add dynamic conversion path to aten::mul evaluator by @mfeliz-cruise in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1710\r\n* [fix] aten::stack with dynamic inputs by @mfeliz-cruise in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1804\r\n* fix undefined attr issue by @bowang007 in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1783\r\n* fix: Out-Of-Bounds bug in Unsqueeze by @gs-olive in https:","2024-02-14T01:49:54",{"id":231,"version":232,"summary_zh":233,"released_at":234},351212,"v1.4.0","## PyTorch 2.0, CUDA 11.8, TensorRT 8.6, Support for the new `torch.compile` API, compatibility mode for FX frontend\r\n\r\nTorch-TensorRT 1.4.0 targets PyTorch 2.0, CUDA 11.8, TensorRT 8.5. This release introduces a number of beta features to set the stage for working with PyTorch and TensorRT in the 2.0 ecosystem. Primarily, this includes a new `torch.compile` backend targeting Torch-TensorRT. It also adds a compatibility layer that allows users of the TorchScript frontend for Torch-TensorRT to seamlessly try FX and Dynamo. \r\n\r\n### torch.compile` Backend for Torch-TensorRT\r\n\r\nOne of the most prominent new features in PyTorch 2.0 is the `torch.compile` workflow, which enables users to accelerate code easily by specifying a backend of their choice. Torch-TensorRT 1.4.0 introduces a new backend for `torch.compile` as a beta feature, including a convenience frontend to perform accelerated inference. This frontend can be accessed in one of two ways:\r\n​\r\n```python\r\nimport torch_tensorrt\r\ntorch_tensorrt.dynamo.compile(model, inputs, ...)\r\n​\r\n##### OR #####\r\n​\r\ntorch_tensorrt.compile(model, ir=\"dynamo_compile\", inputs=inputs, ...)\r\n```\r\nFor more examples, see the provided sample scripts, which can be [found here](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Ftree\u002Fmain\u002Fexamples\u002Fdynamo)\r\n​\r\nThis compilation method has a couple key considerations:\r\n1. It can handle models with data-dependent control flow\r\n2. It automatically falls back to Torch if the TRT Engine Build fails for any reason\r\n3. It uses the Torch FX `aten` library of converters to accelerate models\r\n4. Recompilation can be caused by changing the batch size of the input, or providing an input which enters a new control flow branch\r\n5. Compiled models cannot be saved across Python sessions (yet)\r\n​\r\nThe feature is currently in beta, and we expect updates, changes, and improvements to the above in the future.\r\n\r\n### `fx_ts_compat` Frontend\r\n\r\nAs the ecosystem transitions from TorchScript to Dynamo, users of Torch-TensorRT may want start to experiment with this stack. As such we have introduced a new frontend for Torch-TensorRT which exposes the same APIs as the TorchScript frontend but will use the FX\u002FDynamo compiler stack. You can try this frontend by using the `ir=\"fx_ts_compat\"` setting \r\n\r\n```py\r\ntorch_tensorrt.compile(..., ir=\"fx_ts_compat\")\r\n```\r\n\r\n## What's Changed\r\n* Fix build by @yinghai in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1479\r\n* add circle CI signal in README page by @yinghai in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1481\r\n* fix eisum signature by @yinghai in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1480\r\n* Fix link to CircleCI in README.md by @yinghai in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1483\r\n* Minor changes by @yinghai in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1482\r\n* [FX] Changes done internally at Facebook by @frank-wei in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1456\r\n* chore: upload docs for 1.3.0 by @narendasan in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1504\r\n* fix: Repair Citrinet-1024 compilation issues by @gs-olive in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1488\r\n* refactor: Split elementwise tests by @peri044 in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1507\r\n* [feat] Support 1D topk by @mfeliz-cruise in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1491\r\n* Support aten::sum with bool tensor input by @mfeliz-cruise in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1512\r\n* [fix]Disambiguate cast layer names by @mfeliz-cruise in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1513\r\n* feat: Add functionality for easily benchmarking fx code on key models by @gs-olive in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1506\r\n* [feat]Canonicalize aten::multiply to aten::mul by @mfeliz-cruise in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1517\r\n* broadcast the two input shapes for transposed matmul by @nvpohanh in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1457\r\n* make padding layer converter more efficient by @nvpohanh in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1470\r\n* fix: Change equals-check from reference to value for BERT model not compiling in FX by @gs-olive in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1539\r\n* Update README dependencies section for v1.3.0 by @take-cheeze in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1540\r\n* fix: `aten::where` with differing-shape inputs bugfix by @gs-olive in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1533\r\n* fix: Automatically send truncated long ints to cuda at shape analysis time by @gs-olive in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1541\r\n* feat: Add functionality to FX benchmarking + Improve documentation by @gs-olive in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1529\r\n* [fix] Fix crash when calling unbind on evaluated tensor by @mfeliz-cruise in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1554\r\n* Update test_flatten_aten and test_reshape_aten due to PT2.0 changed tracer behavior for these ops by @frank-wei in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1559\r\n* fix: Bugfix for `align_corners=False`- FX interpolate by @gs-olive i","2023-06-03T04:05:21",{"id":236,"version":237,"summary_zh":238,"released_at":239},351213,"v1.3.0","## PyTorch 1.13, CUDA 11.7, TensorRT 8.5, Support for Dynamic Batch for Partially Compiled Modules, Engine Profiling, Experimental Unified Runtime for FX and TorchScript Frontends\r\n\r\nTorch-TensorRT 1.3.0 targets PyTorch 1.13, CUDA 11.7, cuDNN 8.5 and TensorRT 8.5. This release focuses on adding support for Dynamic Batch Sizes for partially compiled modules using the TorchScript frontend (this is also supported with the FX frontend). It also introduces a new execution profiling utility to understand the execution of specific engine sub blocks that can be used in conjunction with PyTorch profiling tools to understand the performance of your model post compilation. Finally this release introduces a new experimental unified runtime shared by both the TorchScript and FX frontends. This allows you to start using the FX frontend to generate `torch.jit.trace`able compiled modules. \r\n\r\n### Dynamic Batch Sizes for Partially Compiled Modules via the TorchScript Frontend\r\n\r\nA long-standing limitation of the partitioning system in the TorchScript function is lack of support for dynamic shapes. In this release we address a major subset of these use cases with support for dynamic batch sizes for modules that will be partially compiled. Usage is the same as the fully compiled workflow where using the `torch_tensorrt.Input` class, you may define the range of shapes that an input may take during runtime. This is represented as a set of 3 shape sizes: `min`, `max` and `opt`. `min` and `max` define the dynamic range of the input Tensor. `opt` informs TensorRT what size to optimize for provided there are multiple valid kernels available. TensorRT will select kernels that are valid for the full range of input shapes but most efficient at the `opt` size. In this release, partially compiled module inputs can vary in shape for the highest order dimension. \r\n\r\nFor example:\r\n\r\n```\r\nmin_shape: (1, 3, 128, 128)\r\nopt_shape: (8, 3, 128, 128)\r\nmax_shape: (32, 3, 128, 128)\r\n``` \r\n\r\nIs a valid shape range, however:\r\n\r\n```\r\nmin_shape: (1, 3, 128, 128)\r\nopt_shape: (1, 3, 256, 256)\r\nmax_shape: (1, 3, 512, 512)\r\n``` \r\n\r\nis still not supported.\r\n\r\n### Engine Profiling [Experimental]\r\n\r\nThis release introduces a number of profiling tools to measure the performance of TensorRT sub blocks in compiled modules. This can be used in conjunction with PyTorch profiling tools to get a picture of the performance of your model. Profiling for any particular sub block can be enabled by the `enabled_profiling()` method of any `__torch__.classes.tensorrt.Engine` attribute, or of any `torch_tensorrt.TRTModuleNext`. The profiler will dump trace files by default in `\u002Ftmp`, though this path can be customized by either setting the `profile_path_prefix` of `__torch__.classes.tensorrt.Engine` or as an argument to `torch_tensorrt.TRTModuleNext.enable_precision(profiling_results_dir=\"\")`.  Traces can be visualized using the Perfetto tool (https:\u002F\u002Fperfetto.dev)\r\n\r\n![Screenshot 2022-11-21 at 6 23 01 PM](https:\u002F\u002Fuser-images.githubusercontent.com\u002F1790613\u002F203202119-72020c1e-20e3-4c53-bf1c-d18fc97af926.png)\r\n\r\nEngine Layer information can also be accessed using `get_layer_info` which returns a JSON string with the layers \u002F fusions that the engine contains.\r\n\r\n### Unified Runtime for FX and TorchScript Frontends [Experimental]\r\n\r\nIn previous versions of Torch-TensorRT, the FX and TorchScript frontends were mostly separate and each had their distinct benefits and limitations. Torch-TensorRT 1.3.0 introduces a new unified runtime to support both FX and TorchScript meaning that you can choose the compilation workflow that makes the most sense for your particular use case, be it pure Python conversion via FX or C++ Torchscript compilation. Both frontends use the same primitives to construct their compiled graphs be it fully compiled or just partially. \r\n\r\n#### Basic Usage\r\n\r\nThe TorchScript frontend uses the new runtime by default. No additional workflow changes are necessary. \r\n\r\n> Note: The runtime ABI version was increased  to support this feature, as such models compiled with previous versions of Torch-TensorRT will need to be recompiled\r\n\r\nFor the FX frontend, the new runtime can be chosen but setting `use_experimental_fx_rt=True` as part of your compile settings to either `torch_tensorrt.compile(my_mod, ir=\"fx\", use_experimental_fx_rt=True, explicit_batch_dimension=True)` or `torch_tensorrt.fx.compile(my_mod, use_experimental_fx_rt=True, explicit_batch_dimension=True)`\r\n\r\n> Note: The new runtime only supports explicit batch dimension\r\n\r\n#### TRTModuleNext\r\n\r\nThe FX frontend will return a `torch.nn.Module` containing `torch_tensorrt.TRTModuleNext` submodules instead of `torch_tensorrt.fx.TRTModule`s. The features of these modules are nearly identical but with a few key improvements. \r\n\r\n1.  `TRTModuleNext` profiling dumps a trace visualizable with Perfetto (see above for more details). \r\n2. `TRTModuleNext` modules are `torch.jit.trace`-able, meaning you can save FX compiled m","2022-12-01T02:36:49",{"id":241,"version":242,"summary_zh":243,"released_at":244},351214,"v1.2.0","## PyTorch 1.12, Collections based I\u002FO, FX Frontend, torchtrtc custom op support, CMake build system and Community Window Support\r\n\r\nTorch-TensorRT 1.2.0 targets PyTorch 1.12, CUDA 11.6, cuDNN 8.4 and TensorRT 8.4. This release focuses on a couple key new APIs to handle function I\u002FO that uses collection types which should enable whole new model classes to be compiled by Torch-TensorRT without source code modification. It also introduces the \"FX Frontend\", a new frontend for Torch-TensorRT which leverages FX, a high level IR built into PyTorch with extensive Python APIs. For uses cases which do not need to be run outside of Python this may be a strong option to try as it is easily extensible in a familar development enviornment. In Torch-TensorRT 1.2.0, the FX frontend should be considered beta level in stability. `torchtrtc` has received improvements which target the ability to handle operators outside of the core PyTorch op set. This includes custom operators from libraries such as `torchvision` and `torchtext`. Similarlly users can provide custom converters to torchtrtc to extend the compilers support from the command line instead of having to write an application to do so. Finally, Torch-TensorRT introduces community supported Windows and CMake support. \r\n\r\n### New Dependencies \r\n\r\n#### `nvidia-tensorrt`\r\n\r\nFor previous versions of Torch-TensorRT, users had to install TensorRT via system package manager and modify their `LD_LIBRARY_PATH` in order to set up Torch-TensorRT. Now users should install the TensorRT Python API as part of the installation proceedure. This can be done via the following steps: \r\n\r\n```sh\r\npip install nvidia-pyindex\r\npip install nvidia-tensorrt==8.4.3.1\r\npip install torch-tensorrt==1.2.0 -f https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftensorrt\u002Freleases\r\n```\r\nInstalling the TensorRT pip package will allow Torch-TensorRT to automatically load the TensorRT libraries without any modification to enviornment variables. It is also a necessary dependency for the FX Frontend.\r\n\r\n#### `torchvision`\r\n\r\nSome FX frontend converters are designed to target operators from 3rd party libraries like torchvision. As such, you must have torchvision installed in order to use them. However, this dependency is optional for cases where you do not need this support. \r\n\r\n### Jetson\r\n\r\nStarting from this release we will be distributing precompiled binaries of our NGC release branches for aarch64 (as well as x86_64), starting with ngc\u002F22.11. These releases are designed to be paired with NVIDIA distributed builds of PyTorch including the NGC containers and Jetson builds and are equivalent to the prepackaged distribution of Torch-TensorRT that comes in the containers. They represent the state of the master branch at the time of branch cutting so may lag in features by a month or so. These releases will come separately to minor version releases like this one. Therefore going forward, these NGC releases should be the primary release channel used on Jetson (including for building from source). \r\n\r\n**NOTE:** NGC PyTorch builds are not identical to builds you might install through normal channels like pytorch.org. In the past this has caused issues in portability between pytorch.org builds and NGC builds. Therefore we strongly recommend in workflows such as exporting a TorchScript module on an x86 machine and then compiling on Jetson to ensure you are using the NGC container release on x86 for your host machine operations. More information about Jetson support can be found along side the 22.07 release (https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Freleases\u002Ftag\u002Fv1.2.0a0.nv22.07)\r\n\r\n### Collections based I\u002FO [Experimental]\r\n\r\nTorch-TensorRT previously has operated under the assumption that `nn.Module` forward functions can trivially be reduced to the form `forward([Tensor]) -> [Tensor]`. Typically this implies functions fo the form `forward(Tensor, Tensor, ... Tensor) -> (Tensor, Tensor, ..., Tensor)`. However as model complexity increases, grouping inputs may make it easier to manage many inputs. Therefore, function signatures similar to `forward([Tensor], (Tensor, Tensor)) -> [Tensor]` or `forward((Tensor, Tensor)) -> (Tensor, (Tensor, Tensor))` might be more common. In Torch-TensorRT 1.2.0, more of these kinds of uses cases are supported using the new experimental `input_signature` compile spec API. This API allows users to group Input specs similar to how they might group the input Tensors they would use to call the original module's forward function. This informs Torch-TensorRT on how to map a Tensor input from its location in a group to the engine and from the engine into its grouping returned back to the user.   \r\n\r\nTo make this concrete consider the following standard case: \r\n\r\n```py\r\nclass StandardTensorInput(nn.Module):\r\n    def __init__(self):\r\n        super(StandardTensorInput, self).__init__()\r\n\r\n    def forward(self, x, y):\r\n        r = x + y\r\n        return r\r\n\r\nx = torch.Tensor([1,2,3]).to(\"cuda\")\r\ny = torch.Tensor([4,5,6]).to(\"cud","2022-09-14T03:48:28",{"id":246,"version":247,"summary_zh":248,"released_at":249},351215,"v1.1.1","## Adding support for Torch-TensorRT on Jetpack 5.0 Developer Preview\r\n\r\nTorch-TensorRT 1.1.1 is a patch release for Torch-TensorRT 1.1 that targets PyTorch 1.11, CUDA 11.4\u002F11.3, TensorRT 8.4 EA\u002F8.2 and cuDNN 8.3\u002F8.2 intended to add support for Torch-TensorRT on Jetson \u002F Jetpack 5.0 DP. As this release is primarily targeted at adding support for Jetpack 5.0DP for the 1.1 feature set we will not be distributing pre-compiled binaries for this release so as not to break compatibility with the current stack for existing users who install directly from GitHub. Please follow the instructions for installation on Jetson in the documentation to install this release: https:\u002F\u002Fpytorch.org\u002FTensorRT\u002Ftutorials\u002Finstallation.html#compiling-from-source\r\n\r\n### Known Limitations\r\n\r\n- We have observed in testing, higher than normal numerical instability on Jetpack 5.0 DP. These issues are not observed on x86_64 based platforms. This numerical instability has not been found to decrease model accuracy in our test suite.\r\n\r\n## What's Changed\r\n* feat: Upgrade TensorRT to 8.4 EA by @peri044 in https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fpull\u002F1158\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fpytorch\u002FTensorRT\u002Fcompare\u002Fv1.1.0...v1.1.1\r\n\r\nOperators Supported\r\n=================================\r\n\r\n\r\nOperators Currently Supported Through Converters\r\n-------------------------------------------------\r\n\r\n- aten::_convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled, bool allow_tf32) -> (Tensor)\r\n- aten::_convolution.deprecated(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled) -> (Tensor)\r\n- aten::abs(Tensor self) -> (Tensor)\r\n- aten::acos(Tensor self) -> (Tensor)\r\n- aten::acosh(Tensor self) -> (Tensor)\r\n- aten::adaptive_avg_pool1d(Tensor self, int[1] output_size) -> (Tensor)\r\n- aten::adaptive_avg_pool2d(Tensor self, int[2] output_size) -> (Tensor)\r\n- aten::adaptive_avg_pool3d(Tensor self, int[3] output_size) -> (Tensor)\r\n- aten::adaptive_max_pool1d(Tensor self, int[2] output_size) -> (Tensor, Tensor)\r\n- aten::adaptive_max_pool2d(Tensor self, int[2] output_size) -> (Tensor, Tensor)\r\n- aten::adaptive_max_pool3d(Tensor self, int[3] output_size) -> (Tensor, Tensor)\r\n- aten::add.Scalar(Tensor self, Scalar other, Scalar alpha=1) -> (Tensor)\r\n- aten::add.Tensor(Tensor self, Tensor other, Scalar alpha=1) -> (Tensor)\r\n- aten::add_.Tensor(Tensor(a!) self, Tensor other, *, Scalar alpha=1) -> (Tensor(a!))\r\n- aten::asin(Tensor self) -> (Tensor)\r\n- aten::asinh(Tensor self) -> (Tensor)\r\n- aten::atan(Tensor self) -> (Tensor)\r\n- aten::atanh(Tensor self) -> (Tensor)\r\n- aten::avg_pool1d(Tensor self, int[1] kernel_size, int[1] stride=[], int[1] padding=[0], bool ceil_mode=False, bool count_include_pad=True) -> (Tensor)\r\n- aten::avg_pool2d(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=[0, 0], bool ceil_mode=False, bool count_include_pad=True, int? divisor_override=None) -> (Tensor)\r\n- aten::avg_pool3d(Tensor self, int[3] kernel_size, int[3] stride=[], int[3] padding=[], bool ceil_mode=False, bool count_include_pad=True, int? divisor_override=None) -> (Tensor)\r\n- aten::batch_norm(Tensor input, Tensor? gamma, Tensor? beta, Tensor? mean, Tensor? var, bool training, float momentum, float eps, bool cudnn_enabled) -> (Tensor)\r\n- aten::bmm(Tensor self, Tensor mat2) -> (Tensor)\r\n- aten::cat(Tensor[] tensors, int dim=0) -> (Tensor)\r\n- aten::ceil(Tensor self) -> (Tensor)\r\n- aten::clamp(Tensor self, Scalar? min=None, Scalar? max=None) -> (Tensor)\r\n- aten::clamp_max(Tensor self, Scalar max) -> (Tensor)\r\n- aten::clamp_min(Tensor self, Scalar min) -> (Tensor)\r\n- aten::constant_pad_nd(Tensor self, int[] pad, Scalar value=0) -> (Tensor)\r\n- aten::cos(Tensor self) -> (Tensor)\r\n- aten::cosh(Tensor self) -> (Tensor)\r\n- aten::cumsum(Tensor self, int dim, *, int? dtype=None) -> (Tensor)\r\n- aten::div.Scalar(Tensor self, Scalar other) -> (Tensor)\r\n- aten::div.Tensor(Tensor self, Tensor other) -> (Tensor)\r\n- aten::div.Tensor_mode(Tensor self, Tensor other, *, str? rounding_mode) -> (Tensor)\r\n- aten::div_.Scalar(Tensor(a!) self, Scalar other) -> (Tensor(a!))\r\n- aten::div_.Tensor(Tensor(a!) self, Tensor other) -> (Tensor(a!))\r\n- aten::elu(Tensor self, Scalar alpha=1, Scalar scale=1, Scalar input_scale=1) -> (Tensor)\r\n- aten::embedding(Tensor weight, Tensor indices, int padding_idx=-1, bool scale_grad_by_freq=False, bool sparse=False) -> (Tensor)\r\n- aten::eq.Scalar(Tensor self, Scalar other) -> (Tensor)\r\n- aten::eq.Tensor(Tensor self, Tensor other) -> (Tensor)\r\n- aten::erf(Tensor self) -> (Tensor)\r\n- aten::exp(Tensor self) -> (Tensor)\r\n- aten::expand(Tensor(a) self, int[] size, *, bool implicit=False) -> (Tensor(a))\r\n- aten::expand_as(Tensor(a) self, Tensor other) -> (Tensor(a))\r\n- aten::","2022-07-16T01:58:03",{"id":251,"version":252,"summary_zh":253,"released_at":254},351216,"v1.1.0","## Support for PyTorch 1.11, Various Bug Fixes, Partial `aten::Int` support, New Debugging Tools, Removing Max Batch Size\r\n\r\nTorch-TensorRT 1.1.0 targets PyTorch 1.11, CUDA 11.3, cuDNN 8.2 and TensorRT 8.2. Due to recent JetPack upgrades, this release does not support Jetson (Jetpack 5.0DP or otherwise). Jetpack 5.0DP support will arrive in a mid-cycle release (Torch-TensorRT 1.1.x) along with support for TensorRT 8.4. 1.1.0 also drops support for Python 3.6 as it has reached end of life. Following 1.0.0, this release is focused on stabilizing and improving the core of Torch-TensorRT. Many improvements have been made to the partitioning system addressing limitation many users hit while trying to partially compile PyTorch modules. Torch-TensorRT 1.1.0 also addresses a long standing issue with `aten::Int` operators (albeit) partially. Now certain common patterns which use `aten::Int` can be handled by the compiler without resorting to partial compilation. Most notably, this means that models like BERT can be run end to end with Torch-TensorRT, resulting in significant performance gains.  \r\n\r\n### New Debugging Tools\r\n\r\nWith this release we are introducing new syntax sugar that can be used to more easily debug Torch-TensorRT compilation and execution through the use of context managers. For example, in Torch-TensorRT 1.0.0 this may be a common pattern to turn on then turn off debug info: \r\n```py \r\nimport torch_tensorrt\r\n...\r\ntorch_tensorrt.logging.set_reportable_log_level(torch_tensorrt.logging.Level.Debug)\r\ntrt_module = torch_tensorrt.compile(my_module, ...)\r\ntorch_tensorrt.logging.set_reportable_log_level(torch_tensorrt.logging.Level.Warning)\r\nresults = trt_module(input_tensors)\r\n```\r\n\r\nWith Torch-TensorRT 1.1.0, this now can be done with the following code:\r\n```py\r\nimport torch_tensorrt\r\n...\r\nwith torch_tensorrt.logging.debug():\r\n    trt_module = torch_tensorrt.compile(my_module,...)\r\nresults = trt_module(input_tensors)\r\n```\r\n\r\nYou can also use this API to debug the Torch-TensorRT runtime as well: \r\n```py\r\nimport torch_tensorrt\r\ntorch_tensorrt.logging.set_reportable_log_level(torch_tensorrt.logging.Level.Error)\r\n...\r\ntrt_module = torch_tensorrt.compile(my_module,...)\r\nwith torch_tensorrt.logging.warnings():\r\n    results = trt_module(input_tensors)\r\n```\r\n\r\nThe following levels are available:\r\n```py\r\n\r\n# Only internal TensorRT failures will be logged\r\nwith torch_tensorrt.logging.internal_errors():\r\n\r\n# Internal TensorRT failures + Torch-TensorRT errors will be logged\r\nwith torch_tensorrt.logging.errors():\r\n\r\n# All Errors plus warnings will be logged\r\nwith torch_tensorrt.logging.warnings():\r\n\r\n# First verbosity level, information about major steps occurring during compilation and execution\r\nwith torch_tensorrt.logging.info():\r\n\r\n# Second verbosity level, each step is logged + information about compiler state will be outputted\r\nwith torch_tensorrt.logging.debug():\r\n\r\n# Third verbosity level, all above information + intermediate transformations of the graph during lowering\r\nwith torch_tensorrt.logging.graphs():\r\n```\r\n\r\n### Removing Max Batch Size, Strict Types \r\n\r\nIn this release we are removing the `max_batch_size` and `strict_types` settings. These settings directly corresponded to the TensorRT settings, however were not always respected which often lead to confusion. Therefore we thought it best to disable these features as deterministic behavior could not be ensured. \r\n\r\n#### Porting forward from `max_batch_size`, `strict_types`:\r\n\r\n- `max_batch_size`: The first dim in shapes provided to Torch-TensorRT are considered batch dimensions, therefore instead of setting `max_batch_size`, you can just use the Input objects directly \r\n- `strict_types`: A replacement with more deterministic behavior will come with an upcoming TensorRT release. \r\n\r\n### Dependencies\r\n\r\n```\r\n- Bazel 5.1.1\r\n- LibTorch 1.11.0\r\n- CUDA 11.3 (on x86_64, by default, newer CUDA 11 supported with compatible PyTorch Build)\r\n- cuDNN 8.2.4.15\r\n- TensorRT 8.2.4.2\r\n```\r\n\r\n# 1.1.0 (2022-05-10)\r\n\r\n### Bug Fixes\r\n\r\n* add at::adaptive_avg_pool1d in interpolate plugin and fix [#791](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fissues\u002F791) ([deb9f74](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fdeb9f74))\r\n* Added ipywidget dependency to notebook ([0b2040a](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F0b2040a))\r\n* Added test case names ([296e98a](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F296e98a))\r\n* Added truncate_long_and_double ([417c096](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F417c096))\r\n* Adding truncate_long_and_double to ptq tests ([3a0640a](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F3a0640a))\r\n* Avoid resolving non-tensor inputs to torch segment_blocks unneccessarily ([3e090ee](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F3e090ee))\r\n* Considering rtol and atol in threshold comparison for floating point numbers ([0b0ba8d](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F0b0ba8d))\r\n* Disabled mobilenet_v2 test for DLFW CI ([40c611f](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F40c611f))\r\n","2022-05-10T08:23:01",{"id":256,"version":257,"summary_zh":258,"released_at":259},351217,"v1.0.0","## New Name!, Support for PyTorch 1.10, CUDA 11.3, New Packaging and Distribution Options, Stabilized APIs, Stabilized Partial Compilation, Adjusted Default Behavior, Usability Improvements, New Converters, Bug Fixes \r\n\r\nThis is the first stable release of Torch-TensorRT targeting PyTorch 1.10, CUDA 11.3 (on x86_64, CUDA 10.2 on aarch64), cuDNN 8.2 and TensorRT 8.0 with backwards compatible source for TensorRT 7.1. On aarch64 TRTorch targets Jetpack 4.6 primarily with backwards compatible source for Jetpack 4.5. This version also removes deprecated APIs such as `InputRange` and `op_precision` \r\n\r\n### New Name \r\nTRTorch is now Torch-TensorRT! TRTorch started out as a small experimental project compiling TorchScript to TensorRT almost two years ago and now as we are hitting v1.0.0 with APIs and major features stabilizing we felt that the name of the project should reflect the ecosystem of tools it is joining with this release, namely TF-TRT (https:\u002F\u002Fblog.tensorflow.org\u002F2021\u002F01\u002Fleveraging-tensorflow-tensorrt-integration.html) and MXNet-TensorRT(https:\u002F\u002Fmxnet.apache.org\u002Fversions\u002F1.8.0\u002Fapi\u002Fpython\u002Fdocs\u002Ftutorials\u002Fperformance\u002Fbackend\u002Ftensorrt\u002Ftensorrt). Since we were already significantly changing APIs with this release to reflect what we learned over the last two years of using TRTorch, we felt this is was the right time to change the name as well. \r\n\r\nThe overall process to port forward from TRTorch is as follows:\r\n\r\n- #### Python\r\n  - The library has been renamed from `trtorch` to `torch_tensorrt`\r\n  - Components that used to all live under the `trtorch` namespace have now been separated. IR agnostic components: `torch_tensorrt.Input`, `torch_tensorrt.Device`, `torch_tensorrt.ptq`, `torch_tensorrt.logging` will continue to live under the top level namespace. IR specific components like `torch_tensorrt.ts.compile`, `torch_tensorrt.ts.convert_method_to_trt_engine`, `torch_tensorrt.ts.TensorRTCompileSpec` will live in a TorchScript specific namespace. This gives us space to explore the other IRs that might be relevant to the project in the future. In the place of the old top level `compile` and `convert_method_to_engine` are new ones which will call the IR specific versions based on what is provided to them. This also means that you can now provide a raw `torch.nn.Module` to `torch_tensorrt.compile` and Torch-TensorRT will handle the TorchScripting step for you. For the most part the sole change that will be needed to change over namespaces is to exchange `trtorch` to `torch_tensorrt`\r\n\r\n- #### C++ \r\n  - Similar to Python the namespaces in C++ have changed from `trtorch` to `torch_tensorrt` and components specific to the IR like `compile`, `convert_method_to_trt_engine` and `CompileSpec` are in a `torchscript` namespace, while agnostic components are at the top level. Namespace aliases for `torch_tensorrt` -> `torchtrt` and `torchscript` -> `ts` are included. Again the port forward process for namespaces should be a find and replace. Finally the libraries `libtrtorch.so`, `libtrtorchrt.so` and `libtrtorch_plugins.so` have been renamed to `libtorchtrt.so`, `libtorchtrt_runtime.so` and `libtorchtrt_plugins.so` respectively. \r\n\r\n- #### CLI: \r\n   - `trtorch` has been renamed to `torchtrtc` \r\n\r\n### New Distribution Options and Packaging\r\n\r\nStarting with `nvcr.io\u002Fnvidia\u002Fpytorch:21.11`, Torch-TensorRT will be distributed as part of the container (https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fcontainers\u002Fpytorch). The version of Torch-TensorRT in container will be the state of the master at the time of building. Torch-TensorRT will be validated to run correctly with the version of PyTorch, CUDA, cuDNN and TensorRT in the container. This will serve as the easiest way to have a full validated PyTorch end to end training to inference stack and serves as a great starting point for building DL applications. \r\n\r\nAlso as part of Torch-TensorRT we are now starting to distribute the full C++ package within the wheel files for the Python packages. By installing the wheel you now get the Python API, the C++ libraries + headers and the CLI binary. This is going to be the easiest way to install Torch-TensorRT on your stack. After installing with pip \r\n\r\n```\r\npip3 install torch-tensorrt -f https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTorch-TensorRT\u002Freleases\r\n```\r\n\r\nYou can add the following to your `PATH` to set up the CLI\r\n\r\n```\r\nPATH=$PATH:\u003CPATH TO TORCHTRT PYTHON PACKAGE>\u002Fbin\r\n```\r\n\r\n### Stabilized APIs \r\n\r\n#### Python\r\nMany of the APIs have change slighly in this release to be more self consistent and more usable. These changes begin with the Python API for which `compile`, `convert_method_to_trt_engine` and `TensorRTCompileSpec` now instead of dictionaries use kwargs. As features many features came out of beta and experimental stability the necessity to have multiple levels of nesting in settings has decreased, therefore kwargs make much more sense. You can simply port forward to the new APIs by unwrapping your existing `compile_spec` dict in the argumen","2021-11-09T08:26:34",{"id":261,"version":262,"summary_zh":263,"released_at":264},351218,"v0.4.1","# TRTorch v0.4.1\r\n\r\n## Bug Fixes for Module Ignorelist for Partial Compilation, `trtorch.Device`, Version updates for PyTorch, TensorRT, cuDNN\r\n\r\n### Target Platform Changes \r\n\r\nThis is the first patch of TRTorch v0.4. It now targets by default PyTorch 1.9.1, TensorRT 8.0.3.4 and cuDNN 8.2.4.15 and CUDA 11.1. Older versions of PyTorch, TensorRT, cuDNN are still supported in the same manner as TRTorch v0.4.0\r\n\r\n### Module Ignorelist for Partial Compilation \r\n\r\nThere was an issue with the pass marking modules to be ignored during compilation where it unsafely assumed that methods are named `forward` all the way down the module tree. While this was fine for 1.8.0, with PyTorch 1.9.0, the TorchScript codegen changed slightly to sometimes use methods of other names for modules which reduce trivially to a functional api. This fix now will identify method calls as the recursion point and then use those method calls to select modules to recurse on. It will also check to verify existence of these modules and methods before recursing. Finally this pass was run by default even if the ignore list was empty causing issues for users not using the feature. Therefore this pass is now disabled unless explicitly enabled\r\n\r\n### `trtorch.Device`\r\n\r\nSome of the constructors for `trtorch.Device` would not work or incorrectly configure the device. This patch will fix those issues. \r\n\r\n#### Dependencies\r\n```\r\n- Bazel 4.0.0\r\n- LibTorch 1.9.1\r\n- CUDA 11.1 (on x86_64, by default, newer CUDA 11 supported with compatible PyTorch Build), 10.2 (on aarch64)\r\n- cuDNN 8.2.3.4\r\n- TensorRT 8.0.3.4\r\n```\r\n\r\n## 0.4.1 (2021-10-06)\r\n### Bug Fixes\r\n- \u002F\u002Fcore\u002Flowering: Fixes module level fallback recursion (2fc612d)\r\n- Move some lowering passes to graph level logging (0266f41)\r\n- \u002F\u002Fpy: Fix trtorch.Device alternate contructor options (ac26841)\r\n\r\n\r\nOperators Supported\r\n=================================\r\n\r\n\r\nOperators Currently Supported Through Converters\r\n-------------------------------------------------\r\n\r\n- aten::_convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled, bool allow_tf32) -> (Tensor)\r\n- aten::_convolution.deprecated(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled) -> (Tensor)\r\n- aten::abs(Tensor self) -> (Tensor)\r\n- aten::acos(Tensor self) -> (Tensor)\r\n- aten::acosh(Tensor self) -> (Tensor)\r\n- aten::adaptive_avg_pool1d(Tensor self, int[1] output_size) -> (Tensor)\r\n- aten::adaptive_avg_pool2d(Tensor self, int[2] output_size) -> (Tensor)\r\n- aten::adaptive_max_pool2d(Tensor self, int[2] output_size) -> (Tensor, Tensor)\r\n- aten::add.Scalar(Tensor self, Scalar other, Scalar alpha=1) -> (Tensor)\r\n- aten::add.Tensor(Tensor self, Tensor other, Scalar alpha=1) -> (Tensor)\r\n- aten::add_.Tensor(Tensor(a!) self, Tensor other, *, Scalar alpha=1) -> (Tensor(a!))\r\n- aten::asin(Tensor self) -> (Tensor)\r\n- aten::asinh(Tensor self) -> (Tensor)\r\n- aten::atan(Tensor self) -> (Tensor)\r\n- aten::atanh(Tensor self) -> (Tensor)\r\n- aten::avg_pool1d(Tensor self, int[1] kernel_size, int[1] stride=[], int[1] padding=[0], bool ceil_mode=False, bool count_include_pad=True) -> (Tensor)\r\n- aten::avg_pool2d(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=[0, 0], bool ceil_mode=False, bool count_include_pad=True, int? divisor_override=None) -> (Tensor)\r\n- aten::avg_pool3d(Tensor self, int[3] kernel_size, int[3] stride=[], int[3] padding=[], bool ceil_mode=False, bool count_include_pad=True, int? divisor_override=None) -> (Tensor)\r\n- aten::batch_norm(Tensor input, Tensor? gamma, Tensor? beta, Tensor? mean, Tensor? var, bool training, float momentum, float eps, bool cudnn_enabled) -> (Tensor)\r\n- aten::bmm(Tensor self, Tensor mat2) -> (Tensor)\r\n- aten::cat(Tensor[] tensors, int dim=0) -> (Tensor)\r\n- aten::ceil(Tensor self) -> (Tensor)\r\n- aten::clamp(Tensor self, Scalar? min=None, Scalar? max=None) -> (Tensor)\r\n- aten::clamp_max(Tensor self, Scalar max) -> (Tensor)\r\n- aten::clamp_min(Tensor self, Scalar min) -> (Tensor)\r\n- aten::constant_pad_nd(Tensor self, int[] pad, Scalar value=0) -> (Tensor)\r\n- aten::cos(Tensor self) -> (Tensor)\r\n- aten::cosh(Tensor self) -> (Tensor)\r\n- aten::cumsum(Tensor self, int dim, *, int? dtype=None) -> (Tensor)\r\n- aten::div.Scalar(Tensor self, Scalar other) -> (Tensor)\r\n- aten::div.Tensor(Tensor self, Tensor other) -> (Tensor)\r\n- aten::div_.Scalar(Tensor(a!) self, Scalar other) -> (Tensor(a!))\r\n- aten::div_.Tensor(Tensor(a!) self, Tensor other) -> (Tensor(a!))\r\n- aten::elu(Tensor self, Scalar alpha=1, Scalar scale=1, Scalar input_scale=1) -> (Tensor)\r\n- aten::embedding(Tensor weight, Tensor indices, int padding_idx=-1, bool scale_grad_by_freq=False, bool sparse=False) -> (Tensor)\r\n- aten::eq.Scalar(Tensor self, Scalar other) -> (Tensor","2021-10-06T19:14:45",{"id":266,"version":267,"summary_zh":268,"released_at":269},351219,"v0.4.0","# TRTorch v0.4.0\r\n\r\n## Support for PyTorch 1.9, TensorRT 8.0. Introducing INT8 Execution for QAT models, Module Based Partial Compilation, Auto Device Configuration, Input Class, Usability Improvements,  New Converters, Bug Fixes \r\n\r\n### Target Platform Changes\r\n\r\nThis is the fourth beta release of TRTorch, targeting PyTorch 1.9, CUDA 11.1 (on x86_64, CUDA 10.2 on aarch64),  cuDNN 8.2 and TensorRT 8.0 with backwards compatible source for TensorRT 7.1. On aarch64 TRTorch targets Jetpack 4.6 primarily with backwards compatibile source for Jetpack 4.5. When building on Jetson, the flag `--platforms \u002F\u002Ftoolchains:jetpack_4.x` must be now be provided for C++ compilation to select the correct dependency paths. For python by default it is assumed the Jetpack version is 4.6. To override this add the `--jetpack-version 4.5` flag when building. \r\n\r\n### TensorRT 8.0\r\n\r\nThis release adds support for compiling models trained with Quantization aware training (QAT) allowing users using the TensorRT PyTorch Quantization Toolkit (https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT\u002Ftree\u002Fmaster\u002Ftools\u002Fpytorch-quantization) to compile their models using TRTorch. For more information and a tutorial, refer to https:\u002F\u002Fwww.github.com\u002FNVIDIA\u002FTRTorch\u002Ftree\u002Fv0.4.0\u002Fexamples\u002Fint8\u002Fqat. It also adds support for sparsity via the `sparse_weights` flag in the compile spec. This allows TensorRT to utilize specialized hardware in Ampere GPUs to minimize unnecessary computation and therefore increase computational efficiency. \r\n\r\n### Partial Compilation\r\n\r\nIn v0.4.0 the partial compilation feature of TRTorch can now be considered beta level stability. New in this release is the ability to specify entire PyTorch modules to run in PyTorch explicitly as part of partial compilation. This should let users isolate troublesome code easily when compiling. Again, feedback on this feature is greatly appreciated. \r\n\r\n### Automatic Device Configuration at Runtime\r\n\r\n v0.4.0 also changes the \"ABI\" of TRTorch to now include information about the target device for the program. Programs compiled with v0.4.0 will look for and select the most compatible available device. The rules used are: Any valid device option must have the same SM capability as the device building the engine. From there, TRTorch prefers the same device (e.g. Built on A100 so A100 is better than A30) and finally prefers the same device ID.  Users will be warned if this selected device is not the current active device in the course of execution as overhead may be incurred in transferring input tensors from the current device to the target device. Users can then modify their code to avoid this. Due to this ABI change, existing compiled TRTorch programs are incompatible with the TRTorch v0.4.0 runtime. From v0.4.0 onwards an internal ABI version will check program compatibility. This ABI version is only incremented with breaking changes to the ABI. \r\n\r\n### API Changes (Input, enabled_precisions, Device)\r\n\r\nTRTorch v0.4.0 changes the API for specifying Input shapes and data types to provide users more control over configuration. The new API makes use of the class `trtorch.Input` which lets users set the shape (or shape range) as well as memory layout and expected data type. These input specs are set in the `input` field of the ` CompileSpec`. \r\n\r\n```python\r\n\"inputs\": [\r\n        trtorch.Input((1, 3, 224, 224)), # Static input shape for input #1\r\n        trtorch.Input(\r\n            min_shape=(1, 224, 224, 3),\r\n            opt_shape=(1, 512, 512, 3),\r\n            max_shape=(1, 1024, 1024, 3),\r\n            dtype=torch.int32,\r\n            format=torch.channel_last,\r\n        ) # Dynamic input shape for input #2, input type int and channel last format\r\n    ],\r\n```\r\n\r\n\r\n\r\nThe legacy `input_shapes` field and associated usage with lists of tuples\u002F`InputRanges` should now be considered deprecated. They remain usable in v0.4.0 but will be removed in the next release. Similarly, the compile spec field `op_precision` is now also deprecated in favor of `enabled_precisions`. `enabled_precisions` is a set containing the data types that kernels will be allowed to use. Whereas setting `op_precision = torch.int8` would implicitly enable FP32 and FP16 kernels as well, now `enabled_precisions` should be set as `{torch.float32, torch.float16, torch.int8}` to do the same. In order to maintain similar behavior to normal PyTorch, if FP16 is the lowest precision enabled but no explicit data type is set for the inputs to the model, the expectation will be that inputs will be in FP16 . For other cases (FP32, INT8) FP32 is the default, similar to PyTorch and previous versions of TRTorch. Finally in the Python API, a class `trtorch.Device` has been added. While users can continue to use `torch.Device` or other torch APIs, `trtorch.Device` allows for better control for the specific use cases of compiling with TRTorch (e.g. setting DLA core and GPU fallback). This class is very similar to the C++ version with a couple additions of syntactic sugar","2021-08-24T21:49:16",{"id":271,"version":272,"summary_zh":273,"released_at":274},351220,"v0.3.0","# TRTorch v0.3.0\r\n\r\n## Support for PyTorch 1.8.x (by default 1.8.1), Introducing Plugin Library, PTQ from Python, Arbitrary TRT engine embedding, Preview Release of Partial Compilation, New Converters, Bug Fixes\r\n\r\nThis is the third beta release of TRTorch, targeting PyTorch 1.8.x, CUDA 11.1 (on x86_64), TensorRT 7.2, cuDNN 8. TRTorch 0.3.0 binary releases target PyTorch 1.8.1 specifically, these builds are not compatible with 1.8.0, though the source code remains compatible with any PyTorch 1.8.x version. On aarch64 TRTorch targets JetPack 4.5.x. This release introduces `libtrtorch_plugins.so`. This library is a portable distribution of all TensorRT plugins used in TRTorch. The intended usecase is to support TRTorch programs that utilize TensorRT plugins deployed on systems with only the runtime library available or in the case that TRTorch was used to create a TensorRT engine to be run outside the TRTorch runtime, which makes uses of TRTorch plugins. An example on how to use this library can be found here: https:\u002F\u002Fwww.github.com\u002FNVIDIA\u002FTRTorch\u002Ftree\u002Fv0.3.0\u002Fexamples\u002Fsample_rt_app. TRTorch 0.3.0 also now allows users to repurpose PyTorch Dataloaders to do post training quantization in Python similar to the workflow supported in C++ currently. It also introduces a new API to wrap arbitrary TensorRT engines in a PyTorch Module wrapper, making the serializable by `torch.jit.save` and completely compatible with other PyTorch modules. Finally, TRTorch 0.3.0 also includes a preview of the new partial compilation capability of the TRTorch compiler. With this feature, users can now instruct TRTorch to keep operations that are not supported but TRTorch\u002FTensorRT in PyTorch. Partial compilation should be considered alpha stability and we are seeking feedback on bugs, pain points and feature requests surrounding using this feature. \r\n\r\n### Dependencies:\r\n\r\n```\r\n- Bazel 4.0.0\r\n- LibTorch 1.8.1 (on x86_64), 1.8.0 (on aarch64)\r\n- CUDA 11.1 (on x86_64, by default , newer CUDA 11 supported with compatible PyTorch Build), 10.2 (on aarch64)\r\n- cuDNN 8.1.1\r\n- TensorRT 7.2.3.4\r\n```\r\n\r\n\r\n# 0.3.0 (2021-05-13)\r\n\r\n\r\n### Bug Fixes\r\n\r\n* **\u002F\u002Fplugins:** Readding cuBLAS BUILD to allow linking of libnvinfer_plugin on Jetson ([a8008f4](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fa8008f4))\r\n* **\u002F\u002Ftests\u002F..\u002Fconcat:** Concat test fix ([2432fb8](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F2432fb8))\r\n* **\u002F\u002Ftests\u002Fcore\u002Fpartitioning:** Fixing some issues with the partition ([ff89059](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fff89059))\r\n* erase the repetitive nodes in dependency analysis ([80b1038](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F80b1038))\r\n* fix a typo for debug ([c823ebd](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fc823ebd))\r\n* fix typo bug ([e491bb5](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fe491bb5))\r\n* **aten::linear:** Fixes new issues in 1.8 that cause script based ([c5057f8](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fc5057f8))\r\n* register the torch_fallback attribute in Python API ([8b7919f](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F8b7919f))\r\n* support expand\u002Frepeat with IValue type input ([a4882c6](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fa4882c6))\r\n* support shape inference for add_, support non-tensor arguments for segmented graphs ([46950bb](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F46950bb))\r\n\r\n\r\n* feat!: Updating versions of CUDA, cuDNN, TensorRT and PyTorch ([71c4dcb](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F71c4dcb))\r\n* feat(WORKSPACE)!: Updating PyTorch version to 1.8.1 ([c9aa99a](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fc9aa99a))\r\n\r\n\r\n### Features\r\n\r\n* **\u002F\u002F.github:** Linter throws 1 when there needs to be style changes to ([a39dea7](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fa39dea7))\r\n* **\u002F\u002Fcore:** New API to register arbitrary TRT engines in TorchScript ([3ec836e](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F3ec836e))\r\n* **\u002F\u002Fcore\u002Fconversion\u002Fconversionctx:** Adding logging for truncated ([96245ee](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F96245ee))\r\n* **\u002F\u002Fcore\u002Fpartitioing:** Adding ostream for Partition Info ([b3589c5](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fb3589c5))\r\n* **\u002F\u002Fcore\u002Fpartitioning:** Add an ostream implementation for ([ee536b6](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fee536b6))\r\n* **\u002F\u002Fcore\u002Fpartitioning:** Refactor top level partitioning API, fix a bug with ([abc63f6](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Fabc63f6))\r\n* **\u002F\u002Fcore\u002Fplugins:** Gating plugin logging based on global config ([1d5a088](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F1d5a088))\r\n* added user level API for fallback ([f4c29b4](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002Ff4c29b4))\r\n* allow users to set fallback block size and ops ([6d3064a](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F6d3064a))\r\n* insert nodes by dependencies for nonTensor inputs\u002Foutputs ([4e32eff](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F4e32eff))\r\n* support aten::arange converter ([014e381](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTRTorch\u002Fcommit\u002F014e381))\r\n* support aten::transpose with negative dim ([4a1d2f3](https:\u002F\u002F","2021-05-14T00:55:36"]