[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-pytorch--ao":3,"tool-pytorch--ao":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160015,2,"2026-04-18T11:30:52",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":72,"owner_avatar_url":73,"owner_bio":74,"owner_company":75,"owner_location":75,"owner_email":75,"owner_twitter":75,"owner_website":76,"owner_url":77,"languages":78,"stars":115,"forks":116,"last_commit_at":117,"license":118,"difficulty_score":10,"env_os":119,"env_gpu":120,"env_ram":121,"env_deps":122,"category_tags":130,"github_topics":131,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":143,"updated_at":144,"faqs":145,"releases":174},9028,"pytorch\u002Fao","ao","PyTorch native quantization and sparsity for training and inference","TorchAO 是 PyTorch 官方推出的原生量化与稀疏化工具，旨在打通从模型训练到服务部署的全链路优化。它主要解决了大语言模型在训练时显存占用高、速度慢，以及在推理阶段部署成本高昂的难题。通过引入先进的压缩技术，TorchAO 能显著降低资源门槛：例如在预训练 Llama-3.1-70B 时可实现 1.5 倍加速，或将 Llama-3-8B 量化为 int4 格式，使推理速度提升近 1.9 倍的同时减少 58% 的内存占用。\n\n该工具特别适合 AI 研究人员、深度学习工程师以及需要高效部署大模型的开发者使用。其核心技术亮点在于“原生”集成，无需复杂的外部依赖即可在 PyTorch 生态中流畅运行。TorchAO 支持包括 float8 训练、感知量化训练（QAT）以及多种低比特（int4\u002Fint8）推理方案，并已与 Hugging Face Transformers、vLLM、Unsloth 等主流框架深度整合。无论是希望加速大规模预训练的研究团队，还是追求极致推理性能的工程团队，TorchAO 都能提供灵活且高效的解决方案，帮助用户在保持模型精度的前提下，大幅节省计算资源。","\u003Cdiv align=\"center\">\n\n# TorchAO\n\n\u003C\u002Fdiv>\n\n### PyTorch-Native Training-to-Serving Model Optimization\n- Pre-train Llama-3.1-70B **1.5x faster** with float8 training\n- Recover **67% of quantized accuracy degradation** on Gemma3-4B with QAT\n- Quantize Llama-3-8B to int4 for **1.89x faster** inference with **58% less memory**\n\n\u003Cdiv align=\"center\">\n\n[![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCodeML_%40_ICML-2025-blue)](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HpqH0JakHf&name=pdf)\n[![](https:\u002F\u002Fdcbadge.vercel.app\u002Fapi\u002Fserver\u002Fgpumode?style=flat&label=TorchAO%20in%20GPU%20Mode)](https:\u002F\u002Fdiscord.com\u002Fchannels\u002F1189498204333543425\u002F1205223658021458100)\n[![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors-anon\u002Fpytorch\u002Fao?color=yellow&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fgraphs\u002Fcontributors)\n[![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftorchao-documentation-blue?color=DE3412)](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fstable\u002Findex.html)\n[![license](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-BSD_3--Clause-lightgrey.svg)](.\u002FLICENSE)\n\n[Latest News](#-latest-news) | [Overview](#-overview) | [Quick Start](#-quick-start)  | [Installation](#-installation) | [Integrations](#-integrations) | [Inference](#-inference) | [Training](#-training) | [Videos](#-videos) | [Citation](#-citation)\n\n\u003C\u002Fdiv>\n\n\n## 📣 Latest News\n\n- [Oct 25] QAT is now integrated into [Unsloth](https:\u002F\u002Fdocs.unsloth.ai\u002Fnew\u002Fquantization-aware-training-qat) for both full and LoRA fine-tuning! Try it out using [this notebook](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Funslothai\u002Fnotebooks\u002Fblob\u002Fmain\u002Fnb\u002FQwen3_%284B%29_Instruct-QAT.ipynb).\n- [Oct 25] MXFP8 MoE training prototype achieved **~1.45x speedup** for MoE layer in Llama4 Scout, and **~1.25x** speedup for MoE layer in DeepSeekV3 671b - with comparable numerics to bfloat16! Check out the [docs](.\u002Ftorchao\u002Fprototype\u002Fmoe_training\u002F) to try it out.\n- [Sept 25] MXFP8 training achieved [1.28x speedup on Crusoe B200 cluster](https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster\u002F) with virtually identical loss curve to bfloat16!\n- [Sept 19] [TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub](https:\u002F\u002Fpytorch.org\u002Fblog\u002Ftorchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub\u002F)!\n- [Jun 25] Our [TorchAO paper](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HpqH0JakHf&name=pdf) was accepted to CodeML @ ICML 2025!\n\n\n\u003Cdetails>\n  \u003Csummary>Older news\u003C\u002Fsummary>\n\n- [May 25] QAT is now integrated into [Axolotl](https:\u002F\u002Fgithub.com\u002Faxolotl-ai-cloud\u002Faxolotl) for fine-tuning ([docs](https:\u002F\u002Fdocs.axolotl.ai\u002Fdocs\u002Fqat.html))!\n- [Apr 25] Float8 rowwise training yielded [1.34-1.43x training speedup](https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s\u002F) at 2k H100 GPU scale\n- [Apr 25] TorchAO is added as a [quantization backend to vLLM](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Ffeatures\u002Fquantization\u002Ftorchao.html) ([docs](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Ffeatures\u002Fquantization\u002Ftorchao.html))!\n- [Mar 25] Our [2:4 Sparsity paper](https:\u002F\u002Fopenreview.net\u002Fpdf?id=O5feVk7p6Y) was accepted to SLLM @ ICLR 2025!\n- [Jan 25] Our [integration with GemLite and SGLang](https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-llm-inference\u002F) yielded 1.1-2x faster inference with int4 and float8 quantization across different batch sizes and tensor parallel sizes\n- [Jan 25] We added [1-8 bit ARM CPU kernels](https:\u002F\u002Fpytorch.org\u002Fblog\u002Fhi-po-low-bit-operators\u002F) for linear and embedding ops\n- [Nov 24] We achieved [1.43-1.51x faster pre-training](https:\u002F\u002Fpytorch.org\u002Fblog\u002Ftraining-using-float8-fsdp2\u002F) on Llama-3.1-70B and 405B using float8 training\n- [Oct 24] TorchAO is added as a quantization backend to HF Transformers!\n- [Sep 24] We officially launched TorchAO. Check out our blog [here](https:\u002F\u002Fpytorch.org\u002Fblog\u002Fpytorch-native-architecture-optimization\u002F)!\n- [Jul 24] QAT [recovered up to 96% accuracy degradation](https:\u002F\u002Fpytorch.org\u002Fblog\u002Fquantization-aware-training\u002F) from quantization on Llama-3-8B\n- [Jun 24] Semi-structured 2:4 sparsity [achieved 1.1x inference speedup and 1.3x training speedup](https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-neural-network-training\u002F) on the SAM and ViT models respectively\n- [Jun 24] Block sparsity [achieved 1.46x training speeedup](https:\u002F\u002Fpytorch.org\u002Fblog\u002Fspeeding-up-vits\u002F) on the ViT model with \u003C2% drop in accuracy\n\n\u003C\u002Fdetails>\n\n\n## 🌅 Overview\n\nTorchAO is an easy to use quantization library for native PyTorch. TorchAO works out-of-the-box with `torch.compile()` and `FSDP2` across most HuggingFace PyTorch models.\n\nFor a detailed overview of stable and prototype workflows for different hardware and dtypes, see the [Workflows documentation](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fmain\u002Fworkflows.html).\n\nCheck out our [docs](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fmain\u002F) for more details!\n\n## 🚀 Quick Start\n\nFirst, install TorchAO. We recommend installing the latest stable version:\n```bash\npip install torchao\n```\n\nQuantize your model weights to int4!\n```python\nimport torch\nfrom torchao.quantization import Int4WeightOnlyConfig, quantize_\nif torch.cuda.is_available():\n  # quantize on CUDA\n  quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format=\"tile_packed_to_4d\", int4_choose_qparams_algorithm=\"hqq\"))\nelif torch.xpu.is_available():\n  # quantize on XPU\n  quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format=\"plain_int32\"))\n\n```\nSee our [quick start guide](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fstable\u002Fquick_start.html) for more details.\n\n## 🛠 Installation\n\nTo install the latest stable version:\n```bash\npip install torchao\n```\n\n\u003Cdetails>\n  \u003Csummary>Other installation options\u003C\u002Fsummary>\n\n  ```\n  # Nightly\n  pip install --pre torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu128\n\n  # Different CUDA versions\n  pip install torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu126  # CUDA 12.6\n  pip install torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu129  # CUDA 12.9\n  pip install torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fxpu    # XPU\n  pip install torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcpu    # CPU only\n\n  # For developers\n  # Note: the `--no-build-isolation` flag is required.\n  USE_CUDA=1 pip install -e . --no-build-isolation\n  USE_XPU=1 pip install -e . --no-build-isolation\n  USE_CPP=0 pip install -e . --no-build-isolation\n  ```\n\n\u003C\u002Fdetails>\n\nPlease see the [torchao compability table](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F2919) for version requirements for dependencies.\n\n### Optional Dependencies\n\n[MSLK](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FMSLK) is an optional runtime dependency that provides accelerated kernels for some of the workflows in torchao. Stable MSLK should be used with stable torchao, and nightly MSLK with nightly torchao.\n```bash\n# Stable\npip install mslk-cuda==1.0.0\n\n# Nightly\npip install --pre mslk --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu128\n```\n\n## 🔎 Inference\n\nTorchAO delivers substantial performance gains with minimal code changes:\n\n- **Int4 weight-only**: [1.73x speedup with 65% less memory](https:\u002F\u002Fhuggingface.co\u002Fpytorch\u002Fgemma-3-12b-it-INT4) for Gemma3-12b-it on H100 with slight impact on accuracy\n- **Float8 dynamic quantization**: [1.5-1.6x speedup on gemma-3-27b-it](https:\u002F\u002Fhuggingface.co\u002Fpytorch\u002Fgemma-3-27b-it-FP8\u002Fblob\u002Fmain\u002FREADME.md#results-h100-machine) and [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https:\u002F\u002Fgithub.com\u002Fsayakpaul\u002Fdiffusers-torchao) on H100 with preserved quality\n- **Int8 activation quantization and int4 weight quantization**: Quantized Qwen3-4B running with 14.8 tokens\u002Fs with 3379 MB memory usage on iPhone 15 Pro through [ExecuTorch](https:\u002F\u002Fhuggingface.co\u002Fpytorch\u002FQwen3-4B-INT8-INT4#running-in-a-mobile-app)\n- **Int4 + 2:4 Sparsity**: [2.37x throughput with 67.7% memory reduction](torchao\u002Fsparsity\u002FREADME.md) on Llama-3-8B\n\nFollowing is our recommended flow for quantization and deployment:\n```python\nfrom transformers import TorchAoConfig, AutoModelForCausalLM\nfrom torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow\n\n# Create quantization configuration\nquantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))\n\n# Load and automatically quantize\nquantized_model = AutoModelForCausalLM.from_pretrained(\n    \"Qwen\u002FQwen3-32B\",\n    dtype=\"auto\",\n    device_map=\"auto\",\n    quantization_config=quantization_config\n)\n```\n\nAlternative quantization API to use when the above doesn't work is `quantize_` API in [quick start guide](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fmain\u002Fquick_start.html).\n\nServing with vllm on 1xH100 machine:\n```shell\n# Server\nVLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch\u002FQwen3-32B-FP8 --tokenizer Qwen\u002FQwen3-32B -O3\n```\n\n```shell\n# Client\ncurl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fchat\u002Fcompletions -H \"Content-Type: application\u002Fjson\" -d '{\n  \"model\": \"pytorch\u002FQwen3-32B-FP8\",\n  \"messages\": [\n    {\"role\": \"user\", \"content\": \"Give me a short introduction to large language models.\"}\n  ],\n  \"temperature\": 0.6,\n  \"top_p\": 0.95,\n  \"top_k\": 20,\n  \"max_tokens\": 32768\n}'\n```\n\nFor diffusion models, you can quantize using Hugging Face diffusers\n\n```python\nimport torch\nfrom diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig\nfrom torchao.quantization import Int8WeightOnlyConfig\nfrom torchao.quantization.granularity import PerGroup\n\npipeline_quant_config = PipelineQuantizationConfig(\n    quant_mapping={\"transformer\": TorchAoConfig(Int8WeightOnlyConfig(granularity=PerGroup(128)))}\n)\npipeline = DiffusionPipeline.from_pretrained(\n    \"black-forest-labs\u002FFLUX.1-dev\",\n    quantization_config=pipeline_quant_config,\n    torch_dtype=torch.bfloat16,\n    device_map=\"cuda\"\n)\n```\n\nWe also support deployment to edge devices through ExecuTorch, for more detail, see [quantization and serving guide](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fmain\u002Fserving.html). We also release pre-quantized models [here](https:\u002F\u002Fhuggingface.co\u002Fpytorch).\n\n## 🚅 Training\n\n### Quantization-Aware Training\n\nPost-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization-Aware Training (QAT) to overcome this limitation, especially for lower bit-width dtypes such as int4. In collaboration with [TorchTune](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtune\u002Fblob\u002Fmain\u002Frecipes\u002Fquantization.md#quantization-aware-training-qat), we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering **96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext** for Llama3 compared to post-training quantization (PTQ). For more details, please refer to the [QAT README](torchao\u002Fquantization\u002Fqat\u002FREADME.md) and the [original blog](https:\u002F\u002Fpytorch.org\u002Fblog\u002Fquantization-aware-training\u002F):\n\n```python\nimport torch\nfrom torchao.quantization import quantize_, Int8DynamicActivationIntxWeightConfig, PerGroup\nfrom torchao.quantization.qat import QATConfig\n\n# prepare\nbase_config = Int8DynamicActivationIntxWeightConfig(\n    weight_dtype=torch.int4,\n    weight_granularity=PerGroup(32),\n)\nquantize_(my_model, QATConfig(base_config, step=\"prepare\"))\n\n# train model (not shown)\n\n# convert\nquantize_(my_model, QATConfig(base_config, step=\"convert\"))\n```\n\nUsers can also combine LoRA + QAT to speed up training by [1.89x](https:\u002F\u002Fdev-discuss.pytorch.org\u002Ft\u002Fspeeding-up-qat-by-1-89x-with-lora\u002F2700) compared to vanilla QAT using this [fine-tuning recipe](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtune\u002Fblob\u002Fmain\u002Frecipes\u002Fqat_lora_finetune_distributed.py).\n\n\n### Quantized training\n\n[torchao.float8](torchao\u002Ffloat8) implements training recipes with the scaled float8 dtypes, as laid out in https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.05433. With ``torch.compile`` on, current results show throughput speedups of up to **1.5x on up to 512 GPU \u002F 405B parameter count scale** ([details](https:\u002F\u002Fpytorch.org\u002Fblog\u002Ftraining-using-float8-fsdp2\u002F)):\n\n```python\nfrom torchao.float8 import convert_to_float8_training\nconvert_to_float8_training(m)\n```\n\nOur float8 training is integrated into [TorchTitan's pre-training flows](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fblob\u002Fmain\u002Fdocs\u002Ffloat8.md) so users can easily try it out. For more details, check out these blog posts about our float8 training support:\n* [Accelerating Large Scale Training and Convergence with PyTorch Float8 Rowwise on Crusoe 2K H200s](https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s\u002F)\n* [Supercharging Training using float8 and FSDP2](https:\u002F\u002Fpytorch.org\u002Fblog\u002Ftraining-using-float8-fsdp2\u002F)\n* [Efficient Pre-training of Llama 3-like model architectures using torchtitan on Amazon SageMaker](https:\u002F\u002Faws.amazon.com\u002Fblogs\u002Fmachine-learning\u002Fefficient-pre-training-of-llama-3-like-model-architectures-using-torchtitan-on-amazon-sagemaker\u002F)\n* [Float8 in PyTorch](https:\u002F\u002Fdev-discuss.pytorch.org\u002Ft\u002Ffloat8-in-pytorch-1-x\u002F1815)\n\n\u003Cdetails>\n  \u003Csummary>Other features (sparse training, memory efficient optimizers)\u003C\u002Fsummary>\n\n### Sparse Training\n\nWe've added support for semi-structured 2:4 sparsity with **6% end-to-end speedups on ViT-L**. Full blog [here](https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-neural-network-training\u002F). The code change is a 1 liner with the full example available [here](torchao\u002Fsparsity\u002Ftraining\u002F):\n\n```python\nfrom torchao.sparsity.training import SemiSparseLinear, swap_linear_with_semi_sparse_linear\nswap_linear_with_semi_sparse_linear(model, {\"seq.0\": SemiSparseLinear})\n```\n\n### Memory-efficient optimizers\n\nOptimizers like ADAM can consume substantial GPU memory - 2x as much as the model parameters themselves. TorchAO provides two approaches to reduce this overhead:\n\n**1. Quantized optimizers**: Reduce optimizer state memory by 2-4x by quantizing to lower precision\n\n```python\nfrom torchao.optim import AdamW8bit, AdamW4bit, AdamWFp8\noptim = AdamW8bit(model.parameters()) # replace with Adam4bit and AdamFp8 for the 4 \u002F fp8 versions\n```\nOur quantized optimizers are implemented in just a few hundred lines of PyTorch code and compiled for efficiency. While slightly slower than specialized kernels, they offer an excellent balance of memory savings and performance. See detailed [benchmarks here](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Foptim).\n\n**2. CPU offloading**: Move optimizer state and gradients to CPU memory\n\nFor maximum memory savings, we support [single GPU CPU offloading](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Foptim#optimizer-cpu-offload) that efficiently moves both gradients and optimizer state to CPU memory. This approach can **reduce your VRAM requirements by 60%** with minimal impact on training speed:\n\n```python\noptim = CPUOffloadOptimizer(model.parameters(), torch.optim.AdamW, fused=True)\noptim.load_state_dict(ckpt[\"optim\"])\n```\n\n\u003C\u002Fdetails>\n\n\u003C!--\n## For Developers\n\n### Composability\n`torch.compile`: A key design principle for us is composability - any custom dtype or memory layout should work with our compiler. We enable kernel implementations in PyTorch, CUDA, C++, or Triton. This allows researchers and engineers to start with high-level dtype and layout logic in pure PyTorch, then progressively optimize performance by implementing lower-level kernels as needed, while maintaining compatibility with the compile infrastructure.\n\n[FSDP2](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fblob\u002Fmain\u002Fdocs\u002Ffsdp.md): Historically most quantization has been done for inference, there is now a thriving area of research combining distributed algorithms and quantization.\n\nThe best example we have combining the composability of lower bit dtype with compile and fsdp is [NF4](torchao\u002Fquantization\u002Fquantize_\u002Fworkflows\u002Fnf4\u002Fnf4_tensor.py) which we used to implement the [QLoRA](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UvRl4ansfCg) algorithm. So if you're doing research at the intersection of this area we'd love to hear from you.\n\nOur framework makes it straightforward to add tensor parallel support to your custom quantized tensor subclass. Check out our [tensor parallel tutorial](tutorials\u002Fdeveloper_api_guide\u002Ftensor_parallel.py) to see how a quantized tensor subclass can be extended to support column and row-wise tensor sharding while maintaining compatibility with `torch.compile`.\n\n-->\n\n## 🔗 Integrations\n\nTorchAO is integrated into some of the leading open-source libraries including:\n\n* Unsloth now supports QAT: [Read blog](https:\u002F\u002Fdocs.unsloth.ai\u002Fnew\u002Fquantization-aware-training-qat) and [guide](https:\u002F\u002Fdocs.unsloth.ai\u002Fnew\u002Fquantization-aware-training-qat#qat--lora-finetuning).\n* HuggingFace transformers with a [builtin inference backend](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fquantization\u002Ftorchao) and [low bit optimizers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F31865)\n* HuggingFace [diffusers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdiffusers\u002Fmain\u002Fen\u002Fquantization\u002Ftorchao) best practices with `torch.compile` and TorchAO in a standalone repo [diffusers-torchao](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\u002Fblob\u002Fmain\u002Fdocs\u002Fsource\u002Fen\u002Fquantization\u002Ftorchao.md)\n* vLLM for LLM serving: [usage](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Ffeatures\u002Fquantization\u002Ftorchao.html), [detailed docs](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fmain\u002Ftorchao_vllm_integration.html)\n* Integration with [MSLK](https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002FMSLK) for SOTA kernels on server GPUs\n* Integration with [ExecuTorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fexecutorch\u002F) for edge device deployment\n* Axolotl for [QAT](https:\u002F\u002Fdocs.axolotl.ai\u002Fdocs\u002Fqat.html) and [PTQ](https:\u002F\u002Fdocs.axolotl.ai\u002Fdocs\u002Fquantize.html)\n* TorchTitan for [float8 pre-training](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fblob\u002Fmain\u002Fdocs\u002Ffloat8.md)\n* HuggingFace PEFT for LoRA using TorchAO as their [quantization backend](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fpeft\u002Fen\u002Fdeveloper_guides\u002Fquantization#torchao-pytorch-architecture-optimization)\n* TorchTune for our NF4 [QLoRA](https:\u002F\u002Fdocs.pytorch.org\u002Ftorchtune\u002Fmain\u002Ftutorials\u002Fqlora_finetune.html), [QAT](https:\u002F\u002Fdocs.pytorch.org\u002Ftorchtune\u002Fmain\u002Frecipes\u002Fqat_distributed.html), and [float8 quantized fine-tuning](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtune\u002Fpull\u002F2546) recipes\n* SGLang for LLM serving: [usage](https:\u002F\u002Fdocs.sglang.ai\u002Fadvanced_features\u002Fquantization.html#online-quantization)\n\n## 🎥 Videos\n* [Keynote talk at GPU MODE IRL](https:\u002F\u002Fyoutu.be\u002FFH5wiwOyPX4?si=VZK22hHz25GRzBG1&t=1009)\n* [Low precision dtypes at PyTorch conference](https:\u002F\u002Fyoutu.be\u002FxcKwEZ77Cps?si=7BS6cXMGgYtFlnrA)\n* [Slaying OOMs at the Mastering LLM's course](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UvRl4ansfCg)\n* [Advanced Quantization at CUDA MODE](https:\u002F\u002Fyoutu.be\u002F1u9xUK3G4VM?si=4JcPlw2w8chPXW8J)\n* [Chip Huyen's GPU Optimization Workshop](https:\u002F\u002Fwww.youtube.com\u002Flive\u002Fv_q2JTIqE20?si=mf7HeZ63rS-uYpS6)\n* [Cohere for AI community talk](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=lVgrE36ZUw0)\n\n\n## 💬 Citation\n\nIf you find the torchao library useful, please cite it in your work as below.\n\n```bibtex\n@misc{or2025torchao,\n  title={TorchAO: PyTorch-Native Training-to-Serving Model Optimization},\n  author={Andrew Or and Apurva Jain and Daniel Vega-Myhre and Jesse Cai and Charles David Hernandez and Zhenrui Zheng and Driss Guessous and Vasiliy Kuznetsov and Christian Puhrsch and Mark Saroufim and Supriya Rao and Thien Tran and Aleksandar Samardžić},\n  year={2025},\n  eprint={2507.16099},\n  archivePrefix={arXiv},\n  primaryClass={cs.LG},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.16099},\n}\n```\n","\u003Cdiv align=\"center\">\n\n# TorchAO\n\n\u003C\u002Fdiv>\n\n### PyTorch原生训练至推理模型优化\n- 使用float8训练，预训练Llama-3.1-70B速度提升**1.5倍**\n- 通过QAT在Gemma3-4B上恢复**67%的量化精度损失**\n- 将Llama-3-8B量化为int4，推理速度提升**1.89倍**，内存占用减少**58%**\n\n\u003Cdiv align=\"center\">\n\n[![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCodeML_%40_ICML-2025-blue)](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HpqH0JakHf&name=pdf)\n[![](https:\u002F\u002Fdcbadge.vercel.app\u002Fapi\u002Fserver\u002Fgpumode?style=flat&label=TorchAO%20in%20GPU%20Mode)](https:\u002F\u002Fdiscord.com\u002Fchannels\u002F1189498204333543425\u002F1205223658021458100)\n[![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors-anon\u002Fpytorch\u002Fao?color=yellow&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fgraphs\u002Fcontributors)\n[![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftorchao-documentation-blue?color=DE3412)](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fstable\u002Findex.html)\n[![license](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-BSD_3--Clause-lightgrey.svg)](.\u002FLICENSE)\n\n[最新消息](#-latest-news) | [概述](#-overview) | [快速入门](#-quick-start)  | [安装](#-installation) | [集成](#-integrations) | [推理](#-inference) | [训练](#-training) | [视频](#-videos) | [引用](#-citation)\n\n\u003C\u002Fdiv>\n\n\n## 📣 最新消息\n\n- [10月25日] QAT现已集成到[Unsloth](https:\u002F\u002Fdocs.unsloth.ai\u002Fnew\u002Fquantization-aware-training-qat)中，适用于全模型和LoRA微调！使用[此笔记本](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Funslothai\u002Fnotebooks\u002Fblob\u002Fmain\u002Fnb\u002FQwen3_%284B%29_Instruct-QAT.ipynb)即可体验。\n- [10月25日] MXFP8 MoE训练原型在Llama4 Scout的MoE层实现了**约1.45倍加速**，在DeepSeekV3 671b的MoE层则实现了**约1.25倍加速**——且数值与bfloat16相当！请查看[文档](.\u002Ftorchao\u002Fprototype\u002Fmoe_training\u002F)以尝试。\n- [9月25日] MXFP8训练在Crusoe B200集群上实现了**1.28倍加速**（https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster\u002F），且损失曲线与bfloat16几乎完全一致！\n- [9月19日] [TorchAO量化模型及量化配方现已上线Huggingface Hub](https:\u002F\u002Fpytorch.org\u002Fblog\u002Ftorchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub\u002F)！\n- [6月25日] 我们的[TorchAO论文](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HpqH0JakHf&name=pdf)已被ICML 2025的CodeML接收！\n\n\n\u003Cdetails>\n  \u003Csummary>往期新闻\u003C\u002Fsummary>\n\n- [5月25日] QAT现已集成到[Axolotl](https:\u002F\u002Fgithub.com\u002Faxolotl-ai-cloud\u002Faxolotl)中，用于微调（[文档](https:\u002F\u002Fdocs.axolotl.ai\u002Fdocs\u002Fqat.html))！\n- [4月25日] Float8按行训练在2k H100 GPU规模下实现了**1.34-1.43倍训练加速**（https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s\u002F）。\n- [4月25日] TorchAO被添加为[vLLM的量化后端](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Ffeatures\u002Fquantization\u002Ftorchao.html)（[文档](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Ffeatures\u002Fquantization\u002Ftorchao.html))！\n- [3月25日] 我们的[2:4稀疏性论文](https:\u002F\u002Fopenreview.net\u002Fpdf?id=O5feVk7p6Y)已被ICLR 2025的SLLM接收！\n- [1月25日] 我们与GemiLite和SGLang的集成（https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-llm-inference\u002F）在不同批大小和张量并行规模下，使用int4和float8量化使推理速度提升了1.1-2倍。\n- [1月25日] 我们新增了针对线性和嵌入操作的[1-8位ARM CPU内核]（https:\u002F\u002Fpytorch.org\u002Fblog\u002Fhi-po-low-bit-operators\u002F）。\n- [11月24日] 我们在Llama-3.1-70B和405B上使用float8训练实现了**1.43-1.51倍更快的预训练**（https:\u002F\u002Fpytorch.org\u002Fblog\u002Ftraining-using-float8-fsdp2\u002F）。\n- [10月24日] TorchAO被添加为HF Transformers的量化后端！\n- [9月24日] 我们正式发布了TorchAO。请查看我们的博客[这里](https:\u002F\u002Fpytorch.org\u002Fblog\u002Fpytorch-native-architecture-optimization\u002F)！\n- [7月24日] QAT在Llama-3-8B上的量化过程中，成功恢复了**高达96%的精度损失**（https:\u002F\u002Fpytorch.org\u002Fblog\u002Fquantization-aware-training\u002F）。\n- [6月24日] 半结构化2:4稀疏性分别在SAM和ViT模型上实现了**1.1倍推理加速**和**1.3倍训练加速**（https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-neural-network-training\u002F）。\n- [6月24日] 块稀疏性在ViT模型上实现了**1.46倍训练加速**（https:\u002F\u002Fpytorch.org\u002Fblog\u002Fspeeding-up-vits\u002F），且精度仅下降不到2%。\n\n\u003C\u002Fdetails>\n\n\n## 🌅 概述\n\nTorchAO是一个易于使用的原生PyTorch量化库。TorchAO可与`torch.compile()`和`FSDP2`无缝配合，适用于大多数HuggingFace PyTorch模型。\n\n有关不同硬件和数据类型下稳定及原型工作流的详细概述，请参阅[工作流文档](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fmain\u002Fworkflows.html)。\n\n更多详情请查看我们的[文档](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fmain\u002F)！\n\n## 🚀 快速入门\n\n首先，安装TorchAO。我们建议安装最新的稳定版本：\n```bash\npip install torchao\n```\n\n将您的模型权重量化为int4！\n```python\nimport torch\nfrom torchao.quantization import Int4WeightOnlyConfig, quantize_\nif torch.cuda.is_available():\n  # 在CUDA上量化\n  quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format=\"tile_packed_to_4d\", int4_choose_qparams_algorithm=\"hqq\"))\nelif torch.xpu.is_available():\n  # 在XPU上量化\n  quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format=\"plain_int32\"))\n\n```\n更多详情请参阅我们的[快速入门指南](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fstable\u002Fquick_start.html)。\n\n## 🛠 安装\n\n要安装最新稳定版本：\n```bash\npip install torchao\n```\n\n\u003Cdetails>\n  \u003Csummary>其他安装选项\u003C\u002Fsummary>\n\n  ```\n  # 夜间版\n  pip install --pre torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu128\n\n  # 不同CUDA版本\n  pip install torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu126  # CUDA 12.6\n  pip install torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu129  # CUDA 12.9\n  pip install torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fxpu    # XPU\n  pip install torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcpu    # 仅CPU\n\n  # 针对开发者\n  # 注意：需要使用`--no-build-isolation`标志。\n  USE_CUDA=1 pip install -e . --no-build-isolation\n  USE_XPU=1 pip install -e . --no-build-isolation\n  USE_CPP=0 pip install -e . --no-build-isolation\n  ```\n\n\u003C\u002Fdetails>\n\n请参阅[TorchAO兼容性表](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F2919)以了解依赖项的版本要求。\n\n### 可选依赖\n\n[MSLK](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FMSLK)是TorchAO部分工作流中提供加速内核的可选运行时依赖。稳定的MSLK应与稳定的TorchAO搭配使用，而夜间版MSLK则应与夜间版TorchAO搭配使用。\n```bash\n# 稳定版\npip install mslk-cuda==1.0.0\n\n# 夜间版\npip install --pre mslk --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu128\n```\n\n## 🔎 推理\n\nTorchAO 仅需少量代码改动即可带来显著的性能提升：\n\n- **Int4 权重量化**：在 H100 上，Gemma3-12b-it 的推理速度提升 1.73 倍，内存占用减少 65%，且对精度影响较小。\n- **Float8 动态量化**：在 H100 上，Gemma3-27b-it 的推理速度提升 1.5–1.6 倍；Flux.1-Dev* 和 CogVideoX-5b 的推理速度分别提升 1.54 倍和 1.27 倍，同时保持模型质量不变。\n- **Int8 激活量化与 Int4 权重量化**：通过 ExecuTorch，在 iPhone 15 Pro 上，量化后的 Qwen3-4B 模型可达到 14.8 tokens\u002Fs 的吞吐量，内存占用仅为 3379 MB。\n- **Int4 + 2:4 稀疏化**：在 Llama-3-8B 上，吞吐量提升 2.37 倍，内存占用减少 67.7%。\n\n以下是我们的量化与部署推荐流程：\n```python\nfrom transformers import TorchAoConfig, AutoModelForCausalLM\nfrom torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow\n\n# 创建量化配置\nquantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))\n\n# 加载并自动量化\nquantized_model = AutoModelForCausalLM.from_pretrained(\n    \"Qwen\u002FQwen3-32B\",\n    dtype=\"auto\",\n    device_map=\"auto\",\n    quantization_config=quantization_config\n)\n```\n\n如果上述方法不适用，也可使用 [快速入门指南](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fmain\u002Fquick_start.html) 中的 `quantize_` API 进行量化。\n\n使用 vLLM 在单台 H100 机器上进行服务：\n```shell\n# 服务器端\nVLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch\u002FQwen3-32B-FP8 --tokenizer Qwen\u002FQwen3-32B -O3\n```\n\n```shell\n# 客户端\ncurl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fchat\u002Fcompletions -H \"Content-Type: application\u002Fjson\" -d '{\n  \"model\": \"pytorch\u002FQwen3-32B-FP8\",\n  \"messages\": [\n    {\"role\": \"user\", \"content\": \"请简要介绍一下大型语言模型。\"}\n  ],\n  \"temperature\": 0.6,\n  \"top_p\": 0.95,\n  \"top_k\": 20,\n  \"max_tokens\": 32768\n}'\n```\n\n对于扩散模型，可以使用 Hugging Face diffusers 进行量化：\n```python\nimport torch\nfrom diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig\nfrom torchao.quantization import Int8WeightOnlyConfig\nfrom torchao.quantization.granularity import PerGroup\n\npipeline_quant_config = PipelineQuantizationConfig(\n    quant_mapping={\"transformer\": TorchAoConfig(Int8WeightOnlyConfig(granularity=PerGroup(128)))}\n)\npipeline = DiffusionPipeline.from_pretrained(\n    \"black-forest-labs\u002FFLUX.1-dev\",\n    quantization_config=pipeline_quant_config,\n    torch_dtype=torch.bfloat16,\n    device_map=\"cuda\"\n)\n```\n\n我们还支持通过 ExecuTorch 将模型部署到边缘设备上，详情请参阅 [量化与推理指南](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fmain\u002Fserving.html)。此外，我们还在 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fpytorch) 上发布了预量化模型。\n\n## 🚅 训练\n\n### 量化感知训练\n\n后训练量化能够生成快速且紧凑的模型，但也可能导致精度下降。我们建议探索量化感知训练（QAT）以克服这一限制，尤其是在使用较低位宽数据类型（如 int4）时。与 [TorchTune](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtune\u002Fblob\u002Fmain\u002Frecipes\u002Fquantization.md#quantization-aware-training-qat) 合作，我们开发了一套 QAT 流程，相比传统 PTQ 显著提升了模型精度：与 PTQ 相比，Llama3 在 hellaswag 数据集上的准确率恢复了 96%，在 wikitext 数据集上的困惑度下降恢复了 68%。更多细节请参阅 [QAT README](torchao\u002Fquantization\u002Fqat\u002FREADME.md) 和 [官方博客](https:\u002F\u002Fpytorch.org\u002Fblog\u002Fquantization-aware-training\u002F)：\n\n```python\nimport torch\nfrom torchao.quantization import quantize_, Int8DynamicActivationIntxWeightConfig, PerGroup\nfrom torchao.quantization.qat import QATConfig\n\n# 准备阶段\nbase_config = Int8DynamicActivationIntxWeightConfig(\n    weight_dtype=torch.int4,\n    weight_granularity=PerGroup(32),\n)\nquantize_(my_model, QATConfig(base_config, step=\"prepare\"))\n\n# 训练模型（未展示）\n\n# 转换阶段\nquantize_(my_model, QATConfig(base_config, step=\"convert\"))\n```\n\n用户还可以结合 LoRA 和 QAT，利用此 [微调配方](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtune\u002Fblob\u002Fmain\u002Frecipes\u002Fqat_lora_finetune_distributed.py)，使训练速度相比纯 QAT 提升 1.89 倍。\n\n### 量化训练\n\n[torchao.float8](torchao\u002Ffloat8) 实现了基于缩放浮点8位数据类型的训练流程，相关理论见 https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.05433。配合 `torch.compile` 使用时，当前结果表明，在最多 512 张 GPU、参数规模达 405B 的场景下，吞吐量最高可提升 1.5 倍（详情请参阅 [PyTorch 博客](https:\u002F\u002Fpytorch.org\u002Fblog\u002Ftraining-using-float8-fsdp2\u002F)）：\n\n```python\nfrom torchao.float8 import convert_to_float8_training\nconvert_to_float8_training(m)\n```\n\n我们的 float8 训练已集成到 [TorchTitan 的预训练流程](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fblob\u002Fmain\u002Fdocs\u002Ffloat8.md)，方便用户直接尝试。更多详情请参阅以下关于 float8 训练支持的博客文章：\n* [使用 PyTorch Float8 行级量化加速大规模训练与收敛——基于 Crusoe 2K H200 集群]\n* [利用 float8 和 FSDP2 超级加速训练]\n* [使用 TorchTitan 在 Amazon SageMaker 上高效预训练 Llama 3 类似架构模型]\n* [PyTorch 中的 Float8]\n\n\u003Cdetails>\n  \u003Csummary>其他特性（稀疏训练、内存高效优化器）\u003C\u002Fsummary>\n\n### 稀疏训练\n\n我们新增了半结构化 2:4 稀疏化的支持，在 ViT-L 模型上实现了 6% 的端到端速度提升。完整博文请参阅 [PyTorch 博客](https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-neural-network-training\u002F)。代码修改仅需一行，完整示例可在 [torchao\u002Fsparsity\u002Ftraining\u002F](torchao\u002Fsparsity\u002Ftraining\u002F) 查看：\n\n```python\nfrom torchao.sparsity.training import SemiSparseLinear, swap_linear_with_semi_sparse_linear\nswap_linear_with_semi_sparse_linear(model, {\"seq.0\": SemiSparseLinear})\n```\n\n### 内存高效的优化器\n\n像 ADAM 这样的优化器可能会占用大量的 GPU 显存——甚至达到模型参数本身的两倍。TorchAO 提供了两种方法来降低这种开销：\n\n**1. 量化优化器**：通过量化到较低精度，将优化器状态的内存占用减少 2 到 4 倍。\n\n```python\nfrom torchao.optim import AdamW8bit, AdamW4bit, AdamWFp8\noptim = AdamW8bit(model.parameters()) # 替换为 Adam4bit 和 AdamFp8 以使用 4 位和 FP8 版本\n```\n我们的量化优化器仅用几百行 PyTorch 代码实现，并经过编译以提高效率。虽然速度略慢于专用内核，但它们在节省内存和性能之间提供了极佳的平衡。详细基准测试请参见 [这里](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Foptim)。\n\n**2. CPU 卸载**：将优化器状态和梯度转移到 CPU 内存中。\n\n为了最大限度地节省显存，我们支持 [单 GPU CPU 卸载](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Foptim#optimizer-cpu-offload)，该方法可以高效地将梯度和优化器状态转移到 CPU 内存。这种方法可以在对训练速度影响很小的情况下，**将你的显存需求降低 60%**：\n\n```python\noptim = CPUOffloadOptimizer(model.parameters(), torch.optim.AdamW, fused=True)\noptim.load_state_dict(ckpt[\"optim\"])\n```\n\n\u003C\u002Fdetails>\n\n\u003C!--\n## 针对开发者\n\n### 可组合性\n`torch.compile`：对我们来说，一个关键的设计原则是可组合性——任何自定义的数据类型或内存布局都应能与我们的编译器配合使用。我们支持在 PyTorch、CUDA、C++ 或 Triton 中实现内核。这使得研究人员和工程师可以从纯 PyTorch 中的高级数据类型和布局逻辑开始，然后根据需要逐步实现低级内核以优化性能，同时保持与编译基础设施的兼容性。\n\n[FSDP2](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fblob\u002Fmain\u002Fdocs\u002Ffsdp.md)：历史上，大多数量化都是为推理任务进行的，而现在结合分布式算法和量化的研究领域正蓬勃发展。\n我们将低比特数据类型与编译和 FSDP 结合的最佳例子就是 [NF4](torchao\u002Fquantization\u002Fquantize_\u002Fworkflows\u002Fnf4\u002Fnf4_tensor.py)，我们曾用它来实现 [QLoRA](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UvRl4ansfCg) 算法。因此，如果你正在这一领域的交叉研究中工作，我们非常乐意与你交流。\n\n我们的框架使你能够轻松地为自定义的量化张量子类添加张量并行支持。请查看我们的 [张量并行教程](tutorials\u002Fdeveloper_api_guide\u002Ftensor_parallel.py)，了解如何扩展量化张量子类以支持按列和按行的张量分片，同时保持与 `torch.compile` 的兼容性。\n\n-->\n\n## 🔗 集成\n\nTorchAO 已集成到一些领先的开源库中，包括：\n\n* Unsloth 现已支持 QAT：[阅读博客](https:\u002F\u002Fdocs.unsloth.ai\u002Fnew\u002Fquantization-aware-training-qat) 和 [指南](https:\u002F\u002Fdocs.unsloth.ai\u002Fnew\u002Fquantization-aware-training-qat#qat--lora-finetuning)。\n* HuggingFace Transformers 拥有 [内置推理后端](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fquantization\u002Ftorchao) 和 [低比特优化器](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F31865)。\n* HuggingFace [diffusers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdiffusers\u002Fmain\u002Fen\u002Fquantization\u002Ftorchao) 最佳实践，结合 `torch.compile` 和 TorchAO，位于独立仓库 [diffusers-torchao](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\u002Fblob\u002Fmain\u002Fdocs\u002Fsource\u002Fen\u002Fquantization\u002Ftorchao.md) 中。\n* vLLM 用于 LLM 服务：[使用说明](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Ffeatures\u002Fquantization\u002Ftorchao.html)、[详细文档](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fmain\u002Ftorchao_vllm_integration.html)。\n* 与 [MSLK](https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002FMSLK) 集成，用于服务器 GPU 上的 SOTA 内核。\n* 与 [ExecuTorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fexecutorch\u002F) 集成，用于边缘设备部署。\n* Axolotl 支持 [QAT](https:\u002F\u002Fdocs.axolotl.ai\u002Fdocs\u002Fqat.html) 和 [PTQ](https:\u002F\u002Fdocs.axolotl.ai\u002Fdocs\u002Fquantize.html)。\n* TorchTitan 用于 [float8 预训练](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fblob\u002Fmain\u002Fdocs\u002Ffloat8.md)。\n* HuggingFace PEFT 使用 TorchAO 作为其 [量化后端](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fpeft\u002Fen\u002Fdeveloper_guides\u002Fquantization#torchao-pytorch-architecture-optimization)，用于 LoRA。\n* TorchTune 提供我们的 NF4 [QLoRA](https:\u002F\u002Fdocs.pytorch.org\u002Ftorchtune\u002Fmain\u002Ftutorials\u002Fqlora_finetune.html)、[QAT](https:\u002F\u002Fdocs.pytorch.org\u002Ftorchtune\u002Fmain\u002Frecipes\u002Fqat_distributed.html) 以及 [float8 量化微调](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtune\u002Fpull\u002F2546) 等配方。\n* SGLang 用于 LLM 服务：[使用说明](https:\u002F\u002Fdocs.sglang.ai\u002Fadvanced_features\u002Fquantization.html#online-quantization)。\n\n## 🎥 视频\n* [GPU MODE IRL 主题演讲](https:\u002F\u002Fyoutu.be\u002FFH5wiwOyPX4?si=VZK22hHz25GRzBG1&t=1009)\n* [PyTorch 大会上关于低精度数据类型的演讲](https:\u002F\u002Fyoutu.be\u002FxcKwEZ77Cps?si=7BS6cXMGgYtFlnrA)\n* [Mastering LLM's 课程中解决 OOM 问题的讲解](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UvRl4ansfCg)\n* [CUDA MODE 高级量化讲座](https:\u002F\u002Fyoutu.be\u002F1u9xUK3G4VM?si=4JcPlw2w8chPXW8J)\n* [Chip Huyen 的 GPU 优化研讨会](https:\u002F\u002Fwww.youtube.com\u002Flive\u002Fv_q2JTIqE20?si=mf7HeZ63rS-uYpS6)\n* [Cohere for AI 社区演讲](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=lVgrE36ZUw0)\n\n\n## 💬 引用\n如果您觉得 torchao 库很有用，请在您的工作中按以下方式引用它。\n\n```bibtex\n@misc{or2025torchao,\n  title={TorchAO: PyTorch 原生从训练到推理的模型优化},\n  author={Andrew Or 和 Apurva Jain 和 Daniel Vega-Myhre 和 Jesse Cai 和 Charles David Hernandez 和 Zhenrui Zheng 和 Driss Guessous 和 Vasiliy Kuznetsov 和 Christian Puhrsch 和 Mark Saroufim 和 Supriya Rao 和 Thien Tran 和 Aleksandar Samardžić},\n  year={2025},\n  eprint={2507.16099},\n  archivePrefix={arXiv},\n  primaryClass={cs.LG},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.16099},\n}\n```","# TorchAO 快速上手指南\n\nTorchAO 是 PyTorch 原生的模型优化工具库，专注于从训练到推理的全流程量化与加速。它支持 int4、float8 等多种精度，能够显著降低显存占用并提升推理\u002F训练速度，且无需修改模型架构即可与 `torch.compile` 和 HuggingFace 模型无缝集成。\n\n## 环境准备\n\n*   **操作系统**: Linux (推荐), Windows, macOS\n*   **Python**: 3.9 - 3.12\n*   **PyTorch**: 2.4+ (建议使用最新稳定版以獲得最佳兼容性)\n*   **硬件**:\n    *   **GPU**: NVIDIA GPU (推荐 Ampere 架构及以上，如 A100\u002FH100\u002FRTX 30\u002F40 系列) 或 Intel XPU。\n    *   **CPU**: 支持 ARM (Apple Silicon) 及 x86 架构的低比特算子。\n*   **前置依赖**: 确保已安装对应 CUDA 版本的 PyTorch。\n\n> **注意**：TorchAO 强依赖 PyTorch 版本。若遇到兼容性问题，请参考 [官方兼容性表](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F2919)。\n\n## 安装步骤\n\n### 1. 安装稳定版（推荐）\n\n直接使用 pip 安装最新稳定版本：\n\n```bash\npip install torchao\n```\n\n### 2. 指定 CUDA 版本安装\n\n如果您的环境需要特定 CUDA 版本，可使用以下命令：\n\n```bash\n# CUDA 12.6\npip install torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu126\n\n# CUDA 12.9\npip install torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu129\n\n# CPU -only 版本\npip install torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcpu\n```\n\n### 3. 开发者安装 (源码编译)\n\n如需开发或体验最新特性（Nightly 版）：\n\n```bash\n# 安装 Nightly 版本\npip install --pre torchao --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu128\n\n# 或者从源码编译 (需设置 --no-build-isolation)\nUSE_CUDA=1 pip install -e . --no-build-isolation\n```\n\n### 可选依赖：MSLK 加速\n\n[MSLK](https:\u002F\u002Fgithub.com\u002Fpytorch\u002FMSLK) 可为部分工作流提供加速内核：\n\n```bash\n# 稳定版 MSLK\npip install mslk-cuda==1.0.0\n```\n\n## 基本使用\n\nTorchAO 提供了极简的 API 进行模型量化。以下是两种最常用的场景。\n\n### 场景一：快速量化现有模型 (Int4 Weight-Only)\n\n适用于推理加速，可大幅减少显存占用。\n\n```python\nimport torch\nfrom torchao.quantization import Int4WeightOnlyConfig, quantize_\n\n# 假设 model 已经加载并移动到设备\n# model = ... \n\nif torch.cuda.is_available():\n    # CUDA 设备量化配置\n    config = Int4WeightOnlyConfig(\n        group_size=32, \n        int4_packing_format=\"tile_packed_to_4d\", \n        int4_choose_qparams_algorithm=\"hqq\"\n    )\nelif torch.xpu.is_available():\n    # XPU 设备量化配置\n    config = Int4WeightOnlyConfig(\n        group_size=32, \n        int4_packing_format=\"plain_int32\"\n    )\nelse:\n    raise RuntimeError(\"No supported accelerator found\")\n\n# 执行原地量化\nquantize_(model, config)\n\n# 之后可直接用于推理\n# output = model(input_ids)\n```\n\n### 场景二：加载即量化 (HuggingFace Transformers 集成)\n\n通过 `transformers` 库直接加载并量化模型，无需手动修改模型代码。\n\n```python\nfrom transformers import TorchAoConfig, AutoModelForCausalLM\nfrom torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow\n\n# 定义量化配置 (例如 Float8 动态量化)\nquantization_config = TorchAoConfig(\n    quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())\n)\n\n# 加载模型时自动应用量化\nquantized_model = AutoModelForCausalLM.from_pretrained(\n    \"Qwen\u002FQwen3-32B\",       # 替换为目标模型\n    dtype=\"auto\",\n    device_map=\"auto\",\n    quantization_config=quantization_config\n)\n\n# 模型已处于量化状态，可直接生成文本\n```\n\n### 场景三：扩散模型量化 (Diffusers)\n\n针对 Stable Diffusion \u002F Flux 等模型的量化示例：\n\n```python\nimport torch\nfrom diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig\nfrom torchao.quantization import Int8WeightOnlyConfig\nfrom torchao.quantization.granularity import PerGroup\n\n# 配置 Transformer 部分的量化策略\npipeline_quant_config = PipelineQuantizationConfig(\n    quant_mapping={\"transformer\": TorchAoConfig(Int8WeightOnlyConfig(granularity=PerGroup(128)))}\n)\n\npipeline = DiffusionPipeline.from_pretrained(\n    \"black-forest-labs\u002FFLUX.1-dev\",\n    quantization_config=pipeline_quant_config,\n    torch_dtype=torch.bfloat16,\n    device_map=\"cuda\"\n)\n```","某初创团队正在基于 Llama-3-8B 模型开发一款实时法律问答助手，需要在有限的 GPU 资源下同时满足快速微调训练和低延迟上线的需求。\n\n### 没有 ao 时\n- **训练效率低下**：使用传统浮点精度进行全量微调，显存占用极高，导致无法在单卡上运行大批次训练，迭代周期长达数天。\n- **推理成本高昂**：部署时模型体积庞大，显存占用超过 16GB，迫使团队租用昂贵的多卡实例，且首字延迟难以控制在 200ms 以内。\n- **精度与速度难兼得**：尝试手动量化至 int4 后，模型在法律术语理解上出现严重幻觉，准确率下降超过 15%，被迫回退到高精度模式。\n- **集成流程繁琐**：需要编写大量自定义算子来适配不同的量化后端，维护成本高且容易引入兼容性 bug。\n\n### 使用 ao 后\n- **训练大幅加速**：利用 ao 的 float8 原生训练支持，在保持损失曲线一致的前提下，预训练和微调速度提升 1.5 倍，显著缩短研发周期。\n- **推理极致优化**：通过 ao 将模型量化为 int4 格式，显存占用减少 58%，推理速度提升 1.89 倍，成功在单张消费级显卡上实现低延迟部署。\n- **精度完美恢复**：借助 ao 集成的量化感知训练（QAT）技术，找回了 67% 因量化导致的精度损失，确保法律回答的专业性和准确性。\n- **生态无缝衔接**：ao 作为原生后端直接融入 Hugging Face Transformers 和 vLLM 框架，无需修改核心代码即可一键切换量化策略。\n\nao 通过 PyTorch 原生的量化与稀疏化能力，让团队在低成本硬件上实现了从高效训练到高性能推理的全链路优化。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fpytorch_ao_95ad2974.png","pytorch","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fpytorch_be722ba8.jpg","",null,"https:\u002F\u002Fpytorch.org","https:\u002F\u002Fgithub.com\u002Fpytorch",[79,83,87,90,94,98,102,105,109,112],{"name":80,"color":81,"percentage":82},"Python","#3572A5",83,{"name":84,"color":85,"percentage":86},"C++","#f34b7d",12.9,{"name":88,"color":89,"percentage":32},"Cuda","#3A4E3A",{"name":91,"color":92,"percentage":93},"Metal","#8f14e9",0.8,{"name":95,"color":96,"percentage":97},"Shell","#89e051",0.6,{"name":99,"color":100,"percentage":101},"CMake","#DA3434",0.3,{"name":103,"color":104,"percentage":101},"Objective-C++","#6866fb",{"name":106,"color":107,"percentage":108},"Batchfile","#C1F12E",0,{"name":110,"color":111,"percentage":108},"C","#555555",{"name":113,"color":114,"percentage":108},"Makefile","#427819",2786,490,"2026-04-17T17:57:31","NOASSERTION","Linux","NVIDIA GPU 必需 (支持 CUDA 12.6, 12.8, 12.9); 可选 Intel XPU; 支持 ARM CPU (边缘设备)。具体显存取决于模型大小，示例中提到 H100、B200、H200 及 iPhone 15 Pro。","未说明",{"notes":123,"python":121,"dependencies":124},"该工具主要面向 PyTorch 原生环境，深度集成 torch.compile 和 FSDP2。安装时需注意 CUDA 版本匹配（提供 cu126, cu128, cu129 等特定索引）。支持多种量化格式（int4, int8, float8）及稀疏化训练\u002F推理。可选依赖 MSLK 用于加速部分工作流。支持通过 ExecuTorch 部署到移动端。开发版安装需使用 --no-build-isolation 标志。",[125,126,127,128,129],"torch","transformers","vllm","diffusers","mslk-cuda (可选)",[14,35],[132,133,134,135,72,136,137,138,139,140,141,142],"brrr","dtypes","inference","mx","quantization","sparsity","training","float8","transformer","cuda","llama","2026-03-27T02:49:30.150509","2026-04-18T22:35:23.980399",[146,151,156,161,166,170],{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},40487,"在分布式训练（FSDP）中启用 BF16 随机舍入（stochastic rounding）时报错怎么办？","这是一个已知问题，通常与 PyTorch 核心库有关。如果急需修复，可以手动应用以下补丁到 `torchao\u002Foptim\u002Fquant_utils.py` 文件中，以兼容不同版本的 DTensor 导入路径：\n\n```diff\n--- torchao\u002Foptim\u002Fquant_utils.py\n+++ torchao\u002Foptim\u002Fquant_utils.py\n@@ -5,6 +5,13 @@\n import torch\n from torch import Tensor\n+try:\n+    from torch.distributed.tensor import DTensor\n+except Exception:\n+    try:\n+        from torch.distributed._tensor import DTensor\n+    except Exception:\n+        DTensor = tuple()\n \n # ... (后续代码保持不变，注意函数参数名可能需要微调以避免冲突)\n```\n此外，建议关注 PyTorch 主仓库的相关 Issue（如 #156649），官方可能很快会修复此问题。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F2296",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},40488,"TorchAO 是否支持静态量化（Static Quantization）以减少推理时的动态开销？","TorchAO 主要关注动态量化和权重量化。对于静态量化需求：\n1. 如果您针对的是 x86 CPU 或边缘运行时，可以使用基于 PT2 export 的全图捕获量化流程（参考教程：pt2e_quant_ptq_x86_inductor）。\n2. 如果您需要 CUDA 后端的静态量化，目前社区正在讨论中，建议查看相关 RFC 或 Issue 以获取最新进展。您可以明确向维护者说明您需要的具体算子（如 conv\u002Flinear）和目标后端。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F47",{"id":157,"question_zh":158,"answer_zh":159,"source_url":160},40489,"为什么将 int4 权重量化的模型从 CPU 移动到 CUDA 后输出乱码？","这是因为 CPU 和 CUDA 后端的 int4 打包格式（packing format）在数值上是不同的。直接调用 `.cuda()` 不会自动重新打包权重，导致数据解释错误。\n\n**解决方案：**\n避免先量化再移动设备。正确的做法是先将模型移动到目标设备，然后再进行量化。例如：\n```python\n# 错误做法\nquantize_(model.cpu(), int4_weight_only(group_size=groupsize))\nmodel.cuda() # 输出会乱码\n\n# 正确做法\nmodel.cuda()\nquantize_(model, int4_weight_only(group_size=groupsize))\n```\n目前库中尚缺乏自动检测打包来源并重新打包的机制，因此必须严格遵守“先移设备，后量化”的顺序。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F1117",{"id":162,"question_zh":163,"answer_zh":164,"source_url":165},40490,"如何在 ROCm (AMD GPU) 上使用 torchao.float8 进行训练？","截至目前，`torchao.float8` 模块在 ROCm  nightly 版本中尚未完全支持或直接包含。如果您尝试导入发现模块缺失，这属于预期行为。\n建议方案：\n1. 检查是否有特定的 ROCm 分支或单独的库支持 float8。\n2. 关注官方关于 ROCm 支持时间表的更新。\n3. 暂时可能无法在 ROCm 上直接使用 `torchao.float8` 进行训练，需等待官方适配或使用其他替代方案。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F1066",{"id":167,"question_zh":168,"answer_zh":169,"source_url":155},40491,"TorchAO 的主要发展方向和支持的量化技术有哪些？","TorchAO 的战略重点是使用原生 PyTorch 特性加速生成式 AI 模型，并确保与 `torch.compile` 的可组合性。主要规划包括：\n1. **核心技术**：提供 LLM 和其他 GenAI 模型最重要的量化技术，如 GPTQ、AWQ、int8 动态量化和仅权重量化（int8\u002Fint4）。\n2. **内核优化**：维护一套高性能的 CPU\u002FGPU 内核，并持续跟进 SOTA 技术。\n3. **非标准数据类型**：通过 Tensor 子类支持 nf4、any4、mx4 等非标准 dtype。\n4. **易用性**：遵循 PyTorch 设计原则，提供简单的 API 和安装流程。\n最新的 supported workflows 请以项目 README 为准。",{"id":171,"question_zh":172,"answer_zh":173,"source_url":150},40492,"在使用 torchtune 配置 LLAMA3 模型时，如何正确设置优化器以启用 8bit AdamW 和 BF16 随机舍入？","在 torchtune 的 YAML 配置文件中，您需要按以下方式配置优化器组件：\n```yaml\noptimizer:\n  _component_: torchao.optim.AdamW8bit\n  bf16_stochastic_round: true  # 启用 BF16 随机舍入\n  lr: 4.0e-05\n```\n同时确保 `dtype` 设置为 `bf16`。注意：如果在分布式环境（FSDP）下遇到报错，请参考关于分布式 BF16 随机舍入的修复补丁（见 Issue #2296）。",[175,180,185,190,195,200,205,210,215,220,225,230,235,240,245,250,255],{"id":176,"version":177,"summary_zh":178,"released_at":179},323922,"v0.17.0","## 亮点\n\n我们很高兴地宣布 torchao 0.17 版本的发布！此版本新增了对 cuteDSL MXFP8 MoE 核心的支持、按头进行 FP8 量化低精度注意力机制、ABI 稳定性等功能！\n\n### CuteDSL MXFP8 MoE 核心\n\n我们为 3D 专家权重添加了一个新的 CuteDSL MXFP8 量化核心，该核心会将缩放因子直接写入张量核心的分块布局中：[https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F4090](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F4090)\n\n* 用于在使用分组 GEMM 的 MoE 训练的反向传播过程中沿 dim1 维度进行缩放。  \n* 相比之前的“先量化再缩放布局转换”两步法，速度提升了约 12%！\n\n### 按头 FP8 量化低精度注意力机制\n\n我们新增了一个以 FA3 为后端的按头 FP8 量化注意力 API（[https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F3959](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F3959) 和 [https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F3857](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F3857)）。\n\n* 用户可以选择直接用基础模块替换 `F.scaled_dot_product_attention`，也可以使用高级封装器，它会将模块内的所有 F.SDPA 调用替换为低精度注意力变体。  \n* 对封装后的模块调用 `torch.compile` 将会在适当的地方启用 RoPE 融合。  \n* 实验结果显示，在 Wan2.1-T2V-1.3B 上加速倍数为 **1.84x**，在 LLaMA 3 预填充阶段（高序列长度，131k）上加速倍数为 **1.23x**，而在 flux.1-schnell 上（图像尺寸为 2048×2048）则加速倍数为 **1.07x**。\n\n直接替换的使用示例：\n\n```py\nfrom torchao.prototype.attention.fp8_fa3 import fp8_fa3_sdpa, fp8_fa3_rope_sdpa\nout = fp8_fa3_sdpa(q, k, v)\n```\n\n封装器的使用示例：\n\n```py\nfrom torchao.prototype.attention import (\n    AttentionBackend,\n    LowPrecisionAttentionConfig,\n    apply_low_precision_attention,\n)\n# 实例化任意 nn.Module()\nmodel = MyModel()\n\n# 简单的 SDPA 替换\nconfig = LowPrecisionAttentionConfig(backend=AttentionBackend.FP8_FA3)\nmodel = apply_low_precision_attention(model, config)\n\n# Flash 激活由封装器内部处理\noutput = model(inputs)\n\n# 调用 torch.compile 将启用 RoPE 融合\nmodel = torch.compile(model)\n```\n\n### PyTorch ABI 稳定性\n\n根据 [https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F3516](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F3516)，torchao 现在已实现所有 CUDA 核心的 ABI 稳定性！这意味着如果用户运行的是 PyTorch 2.11 或更高版本，他们无需为每个新的 torchao 版本升级 PyTorch，即可直接使用 torchao 的 CUDA 核心。这一稳定性适用于当前及未来的所有 torchao 版本（0.17.0 及以上）。需要注意的是，仅 Python API 的兼容性与之前相同：我们支持最新的 3 个 PyTorch 次要版本。\n\n此前：\n\n```py\n# C++ 扩展的兼容版本：\ntorchao 0.16.0 + PyTorch 2.10.0\ntorchao 0.15.0 + PyTorch 2.9.1\ntorchao 0.14.1 + PyTorch 2.9.0\n\n# 仅 Python API 的兼容版本\n# 最近的 3 个 PyTorch 版本：\ntorchao 0.16.0 + PyTorch 2.10.0、2.9.1、2.8.0\ntorc","2026-03-30T22:38:47",{"id":181,"version":182,"summary_zh":183,"released_at":184},323923,"v0.16.0","## 亮点\n\n我们很高兴地宣布 torchao 0.16.0 版本的发布！此版本新增了对 MXFP8 MoE 构建模块的支持，用于实现专家并行训练，并弃用了部分配置的旧版本以及较少使用的量化选项，以使 torchao 更加精简！此外，我们还重新设计了 [文档页面](https:\u002F\u002Fdocs.pytorch.org\u002Fao\u002Fmain\u002F) 和 [README](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002Fmain\u002FREADME.md)，并在实现 torchao 的 [ABI 稳定性](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F3516)方面取得了一些进展。\n\n### MXFP8 MoE 构建模块：用于专家并行训练\n\n此版本包含以下适用于 MXFP8 MoE 训练（采用专家并行通信）的可微分构建模块：\n\n* [a2a_dispatch_mxfp8_fwd_hp_bwd](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002Feb39a84e6ef89de70f0734877eaa521e0ac39d43\u002Ftorchao\u002Fprototype\u002Fmoe_training\u002Fep\u002Fa2a_dispatch.py#L143)：全互连令牌调度（MXFP8 前向传播，BF16 反向传播）  \n* [permute_mxfp8_fwd_hp_bwd](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002Feb39a84e6ef89de70f0734877eaa521e0ac39d43\u002Ftorchao\u002Fprototype\u002Fmoe_training\u002Fep\u002Fpermute.py#L212)：为 MXFP8 计算对令牌进行打乱与填充（MXFP8 前向传播，BF16 反向传播）  \n* [\\_to_mxfp8_then_scaled_grouped_mm](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002Feb39a84e6ef89de70f0734877eaa521e0ac39d43\u002Ftorchao\u002Fprototype\u002Fmoe_training\u002Fscaled_grouped_mm.py#L1006)：用于路由专家计算的 MXFP8 分组 GEMM（**新增**：可选择接受预量化输入）。输出为 bfloat16 格式。  \n* [unpermute_hp_fwd_mxfp8_bwd](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002Feb39a84e6ef89de70f0734877eaa521e0ac39d43\u002Ftorchao\u002Fprototype\u002Fmoe_training\u002Fep\u002Funpermute.py#L115)：将令牌恢复到原始顺序（BF16 前向传播，MXFP8 反向传播）  \n* [a2a_combine_hp_fwd_mxfp8_bwd](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002Feb39a84e6ef89de70f0734877eaa521e0ac39d43\u002Ftorchao\u002Fprototype\u002Fmoe_training\u002Fep\u002Fa2a_combine.py#L161)：全互连令牌合并（BF16 前向传播，MXFP8 反向传播）。请注意，此处并未执行实际的合并或聚合操作，命名仅表明其用途是紧接在聚合操作之前的全互连步骤。\n\n这些自动求导函数可以串联使用，从而实现基于专家并行通信和 MXFP8 分组 GEMM 的高效 MoE 训练。\n\n采用该方法后，DeepSeekV3 16B 模型的训练速度可提升 10%–25% 每秒处理的令牌数：\n\n* 在单节点 8 张 B200 显卡的集群中，借助 NVLink 节点内互联进行设备间通信时，每秒处理的令牌数提升 10%。  \n* 在多节点 B200 集群中，结合 IB 节点间互联和 NVLink 节点内互联时，每秒处理的令牌数提升 25%。  \n\n## 已弃用功能\n\n* 废弃 `Float8WeightOnlyConfig`、`Float8DynamicActivationFloat8WeightConfig`、`Int8DynamicActivationIntxWeightConfig`、`IntxWeightOnlyConfig` 和 `Int4WeightOnlyConfig` 的 v1 版本（[https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F3510](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F3510)、[https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F3511](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F3511)、[https:\u002F\u002Fgithub.com\u002Fp","2026-02-10T23:49:42",{"id":186,"version":187,"summary_zh":188,"released_at":189},323924,"v0.15.0","## 亮点\n\n我们很高兴地宣布 torchao 0.15.0 版本的发布！本次更新新增以下内容：\n\n- MXFP8 MoE 训练在 64 节点 GB200 Crusoe 集群上训练 Llama4 Scout 时，与 bf16 相比，在收敛效果完全一致的情况下实现了 1.2 倍的端到端训练加速！\n- MXFP8 MoE 内核已随 torchao 的 CUDA 12.8+ 构建版本一同发布（只需通过 pip 安装即可使用，无需从源码编译！）\n- 支持 Safetensors 格式\n- 引入参数级量化目标设定功能\n\n### MXFP8 MoE 训练在 64 节点 GB200 Crusoe 集群上训练 Llama4 Scout 时，与 bf16 相比，在收敛效果完全一致的情况下实现了 1.2 倍的端到端训练加速\n\n在配备 TorchTitan 的 64 节点 GB200 集群上运行的 TorchTitan Llama4 Scout 训练任务，相较于 bfloat16 基线训练，实现了 1.2 倍的端到端加速，且收敛效果相当。事实上，在经过 3,000 步训练后，其损失甚至略低于 bfloat16！这一结果与我们此前针对 [密集模型的 MXFP8 训练](https:\u002F\u002Fpytorch.org\u002Fblog\u002Faccelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster\u002F) 所做的扩展性实验结果一致。\n\n\u003Cimg width=\"790\" height=\"403\" alt=\"mxfp8_with_loss\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F42d54b96-caa6-4ff7-8bac-16541cd8d7f8\" \u002F>\n\n| GPU 数量 | BF16 tokens\u002Fsec | MXFP8 tokens\u002Fsec | MXFP8 相对于 BF16 的加速倍数 |\n| ----------------------- | --------------: | ----------------: | ---------------------: |\n| 512 | 6169 | 7401 | 1.20x |\n\n更多详细信息请参阅 [TorchAO MXFP8 MoE 训练文档](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002Fmain\u002Ftorchao\u002Fprototype\u002Fmoe_training\u002FREADME.md)。此外，您还可以查看 [TorchTitan MXFP8 文档](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fblob\u002Fmain\u002Fdocs\u002Fmxfp8.md)，只需添加一条配置即可使用 TorchAO MXFP8 运行预训练任务。\n### Safetensors 支持\n\n现在您可以使用 safetensors 格式保存和加载 TorchAO 模型检查点！此功能自 v5.0.0 版本起已集成到 Hugging Face Transformers 中，并支持 vLLM 0.13.0 用于模型推理和部署。\n\n目前我们支持以下稳定的配置：\n`Float8DynamicActivationFloat8WeightConfig`\n`Int4WeightOnlyConfig`\n`IntxWeightOnlyConfig`\n`Int8DynamicActivationIntxWeightConfig`\n`Int8WeightOnlyConfig`\n`Int8DynamicActivationInt8WeightConfig`\n\n未来我们将继续为更加稳定的配置提供支持。\n\n\n示例：\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig\nfrom torchao.quantization import Float8WeightOnlyConfig\n\n\nmodel_id = \"facebook\u002Fopt-125m\"\nquant_config = Float8WeightOnlyConfig()\nquantization_config = TorchAoConfig(quant_type=quant_config)\nquantized_model = AutoModelForCausalLM.from_pretrained(\n   model_id,\n   device_map=\"auto\",\n   torch_dtype=torch.bfloat16,\n   quantization_config=quantization_config,\n)\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n\n\n#### 推送至 Hub\nMODEL_NAME = model_id.split(\"\u002F\")[-1]\nsave_to = f\"torchao-testing\u002F{MODEL_NAME}-Float8WeightOn","2025-12-22T19:09:43",{"id":191,"version":192,"summary_zh":193,"released_at":194},323925,"v0.14.1","## **亮点**  \n\n我们很高兴地宣布 torchao 0.14.1 版本发布！此版本新增了对 Blackwell GPU 上 MoE 训练以及 NVFP4 QAT 的支持！  \n\n### **（原型）Blackwell GPU 上的 MoE 训练**  \n\n我们为 Blackwell GPU 上加速 MoE 训练添加了一个量化构建模块：torchao 的 \\`\\_scaled\\_grouped\\_mm\\`\\！它是 \\`torch.\\_grouped\\_mm\\` 的可微分替换组件，可根据给定的量化方案动态量化输入，执行缩放后的分组 GEMM 运算，然后以原始精度返回结果。这带来了显著的速度提升（见下方基准测试）！  \n\n```py  \nimport torch  \nfrom torch.nn import functional as F  \nfrom torchao.prototype.moe_training import (  \n    _scaled_grouped_mm as torchao_scaled_grouped_mm  \n)  \nfrom torchao.prototype.moe_training.conversion_utils import MoEScalingType  \nfrom torchao.prototype.moe_training.utils import generate_jagged_offs  \n\nnum_groups, total_M, N, K = 8, 131072, 8192, 5120  \n\n# A = 输入激活，B = 专家权重  \nA = torch.randn(total_M, K, dtype=torch.bfloat16, device=\"cuda\", requires_grad=True)  \nB = torch.randn(num_groups, N, K, dtype=torch.bfloat16, device=\"cuda\", requires_grad=True)  \n\n# 由实际 MoE 层中的路由机制计算出的 token 分组偏移量  \noffs = generate_jagged_offs(num_groups, total_M, device=\"cuda\")  \n\n# 前向与反向传播示例  \nout = torchao_scaled_grouped_mm(  \n        A,  \n        B.transpose(-2, -1),  \n        offs=offs,  \n        scaling_type=MoEScalingType.MXFP8,  \n)  \nlabels = torch.ones_like(out)  \nloss = F.mse_loss(out, labels)  \nloss.backward()  \n```  \n\n微基准测试（有关复现基准测试的命令，请参阅 [README](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Fprototype\u002Fmoe_training)）：  \n\n* 前向 + 反向传播 vs torch.\\_grouped\\_mm：  \n  * 对于 Llama4 17bx16e 模型尺寸，速度提升约 1.4–1.8 倍  \n  * 对于 DeepSeekV3 671b 模型尺寸，速度提升约 1.2–1.4 倍  \n* 完整 MoE 层前向 + 反向传播：  \n  * 约 1.4 倍提速（Llama4 17bx16e 尺寸，batch\\_size=8，seq\\_len=16384）  \n  * 约 1.2 倍提速（DeepSeekV3 671b 尺寸，batch\\_size=8，seq\\_len=16384）。  \n\n该功能也已集成到 TorchTitan 中，用于 DeepSeekV3 和 Llama4 的端到端训练！只需使用命令行标志：\\`--model.converters=”quantize.grouped\\_mm.mx”\\`，即可在后台将所有 \\`torch.\\_grouped\\_mm\\` 操作转换为 torchao 的 \\_scaled\\_grouped\\_mm 操作：  \n\nTorchitan 端到端训练基准测试（有关复现基准测试的命令，请参阅 [README](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Fprototype\u002Fmoe_training)）：  \n\n* 在 4 张通过 NVLink 连接的 B200 GPU 上，采用 dp2ep（无 TP）并行策略时，两层 Llama4 16e 模型的端到端训练速度提升约 1.4 倍  \n\n### **（原型）NVFP4 QAT ([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F3050](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F3050))**  \n\n我们以原型功能的形式新增了对 NVFP4 的量化感知训练（QAT）支持！该功能目前仅在 Blackwell GPU 上可用：  \n\n```py  \nfrom torchao.quantization import quantize_  \nfrom torchao.prototype.","2025-10-13T21:34:36",{"id":196,"version":197,"summary_zh":198,"released_at":199},323926,"v0.13.0-rc8","## **亮点**\n\n我们很高兴地宣布 torchao 0.13.0 版本发布！此版本新增了对多项 QAT 改进的支持，以及更快的 mxfp8 预训练等功能！\n\n### **更简单的多步 QAT API (**[https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2629](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2629)**)**\n\n我们新增了一个更简单、仅需一个配置文件的多步 QAT API。现在用户只需指定目标的后训练量化（PTQ）配置作为基础配置，系统便会自动推断出应使用的正确伪量化配置！\n\n```py\nfrom torchao.quantization import (\n    quantize_,\n    Int8DynamicActivationInt4WeightConfig\n)\nfrom torchao.quantization.qat import QATConfig\n\n# 准备阶段\nbase_config = Int8DynamicActivationInt4WeightConfig(group_size=32)\nqat_config = QATConfig(base_config, step=\"prepare\")\nquantize_(m, qat_config)\n\n# 训练阶段（未展示）\n\n# 转换阶段\nquantize_(m, QATConfig(base_config, step=\"convert\"))\n```\n\n对于更高级的用例，用户仍可像以往一样指定具体的 FakeQuantizeConfig：\n\n```py\n# 准备阶段\nactivation_config = IntxFakeQuantizeConfig(torch.int8, \"per_token\", is_symmetric=False)\nweight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32)\nqat_config = QATConfig(\n    activation_config=activation_config,\n    weight_config=weight_config,\n    step=\"prepare\",\n)\nquantize_(model, qat_config)\n\n# 训练和转换阶段（未展示）\n```\n\n### **（原型）NVFP4 和 FP8 QAT** ([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2735](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2735), [https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2666](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2666))\n\n我们扩展了 QAT 的支持范围，使其能够应用于 FP8 和 NVFP4 场景。您可以按以下方式试用：\n\n```py\nfrom torchao.quantization import (\n    quantize_,\n    Float8DynamicActivationInt4WeightConfig,\n    Float8DynamicActivationFloat8WeightConfig,\n    Float8WeightOnlyConfig,\n)\nfrom torchao.prototype.mx_formats import NVFP4InferenceConfig\nfrom torchao.quantization.qat import QATConfig\n\n# 选择一个基础配置\nbase_config = Float8DynamicActivationInt4WeightConfig()  # 或\nbase_config = Float8DynamicActivationInt8WeightConfig()  # 或\nbase_config = NVFP4InferenceConfig()\n\n# 准备阶段\nqat_config = QATConfig(base_config, step=\"prepare\")\nquantize_(m, qat_config)\n\n# 训练阶段（未展示）\n\n# 转换阶段\nquantize_(m, QATConfig(base_config, step=\"convert\"))\n```\n\n此外，用户还可以针对更复杂的场景使用更为具体的 FakeQuantizeConfig，例如：\n\n```py\nfrom torchao.quantization import PerRow\nfrom torchao.quantization.qat import Float8FakeQuantizeConfig\nfrom torchao.prototype.qat import NVFP4FakeQuantizeConfig\n\nact_config = Float8FakeQuantizeConfig(torch.float8_e4m3fn, PerRow())\nweight_config = NVFP4FakeQuantizeConfig(use_per_tensor_scale=True)\n\n# 准备阶段\nqat_config = QATConfig(\n    activation_config=activation_config,\n    weight_config=weight_config,\n    step=\"prepare\",\n)\nquantize_(model, qat_config)\n\n# 训练和转换阶段","2025-09-02T17:57:07",{"id":201,"version":202,"summary_zh":203,"released_at":204},323927,"v0.12.0","## 亮点\n\n我们很高兴地宣布 torchao 0.12.0 版本发布！此版本新增了对 QAT + Axolotl 集成的支持，并在 Blackwell GPU 上实现了 MXFP\u002FNVFP 的原型支持！\n\n### QAT + Axolotl 集成\n\nTorchAO 的 QAT 支持现已集成到 Axolotl 的微调配方中！您可以通过[此处](https:\u002F\u002Fdocs.axolotl.ai\u002Fdocs\u002Fqat.html)查看文档，或使用以下命令亲自运行：\n\n```shell\naxolotl train examples\u002Fllama-3\u002F3b-qat-fsdp2.yaml\naxolotl quantize examples\u002Fllama-3\u002F3b-qat-fsdp2.yaml\n```\n\n由 @SalmanMohammadi 提供的 Llama3.2-3B 初始结果如下（[PR 2590](https:\u002F\u002Fgithub.com\u002Faxolotl-ai-cloud\u002Faxolotl\u002Fpull\u002F2590)）：\n| 模型\u002F指标 | hellaswag 准确率 | hellaswag 标准化准确率 | wikitext 每字节位数 | wikitext 字节困惑度 | wikitext 词困惑度 |\n|-----------|------------------|---------------------|----------------------|-----------------------|-----------------------|\n| bfloat16  | 0.5552           | 0.7315              | 0.6410               | 1.5594                | 10.7591               |\n| bfloat16 PTQ | 0.5393          | 0.7157              | 0.6613               | 1.5815                | 11.6033               |\n| qat ptq   | 0.5423           | 0.7180              | 0.6567               | 1.5764                | 11.4043               |\n| 恢复后 (qat ptq) | 18.87%         | 14.56%              | 22.66%               | 23.08%                | 23.57%                |\n\n\n### \\[原型 | API 尚未定稿\\] Blackwell GPU 上的 MXFP 和 NVFP 支持\n\nTorchAO 现已包含对 NVIDIA 最新 Blackwell GPU 架构上 NVFP4（NVIDIA 的 4 位浮点格式）和微缩尺度（MX）格式的原型支持。这些格式能够实现高效的推理，在 vLLM 中对 Qwen3 模型的端到端性能提升可达 61%，而对于扩散模型工作负载则可获得近 2 倍的加速。\n\n使用方法如下：\n\n```py\nfrom torchao.quantization import quantize_\nfrom torchao.prototype.mx_formats import (\n    MXFPInferenceConfig,\n    NVFP4InferenceConfig,\n)\n# 使用 MXFP8 对模型进行量化\nmodel = quantize_(model, MXFPInferenceConfig(block_size=32))\n# 将模型量化为 NVFP4（不启用双倍缩放）\nmodel = quantize_(model, NVFP4InferenceConfig())\n```\n\n**注意**：这是一项处于原型阶段的功能，API 可能会发生变化。需要配备 CUDA 12.8 或更高版本的 NVIDIA Blackwell GPU（B200、5090）。\n\n## 向后不兼容变更\n\n* 从 choose_qparams_affine 中移除 preserve_zero 和 zero_point_domain（[PR 2149](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2149)）\n* 重命名 tinygemm 的量化参数（[PR 2344](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2344)）\n* 将 quant_primitives 方法改为私有（[PR 2350](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2350)）\n* 删除 Galore（[PR 2397](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2397)）\n* 进一步移除 Galore 相关内容（[PR 2417](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2417)）\n* 移除 `sparsity\u002Fprototype\u002Fblocksparse`（[PR 2205](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F2205)）\n\n## 已弃用功能\n\n* 清理 prototype 文件夹（[PR","2025-07-17T17:56:01",{"id":206,"version":207,"summary_zh":208,"released_at":209},323928,"v0.11.0","## 亮点\n\n我们很高兴地宣布 torchao 0.11.0 版本的发布！此版本新增了对专家混合模型（MoE）量化、PyTorch 2 导出量化（PT2E）的支持，以及用于推理 API 的微基准测试框架！\n\n### MoE 量化\n\n我们实现了一个原型功能，可以使用多种 TorchAO 量化技术对 MoE 模块进行量化。该方法利用了 TorchAO 现有的线性运算量化特性，使其能够应用于 MoE 模块的量化。\n\n```py\nfrom torchao.quantization.prototype.moe_quant.utils import cond_ffn_filter, MoEQuantConfig\nfrom torchao.quantization.quant_api import quantize_, Int8WeightOnlyConfig\n\nquantize_(\n    model, \n    MoEQuantConfig(Int8WeightOnlyConfig()),   \n    filter_fn=cond_ffn_filter\n)\nmodel=torch.compile(\n    model, \n    mode=\"reduce-overhead\", \n    fullgraph=is_single_token_inference\n)\n```\n\n虽然上述 API 已经足以对 MoE 模块进行量化，但前提是你的 MoE 模块既可量化又可编译。然而，在实际应用中，由于 MoE 实现方式的多样性，很少有用户模型能够同时满足这两个条件。因此，首先需要将普通的 MoE 模块替换为 `MoEFeedForwardAOQuantizable` 模块，以初步准备好模型进行量化。在 `llama4_quant.py` 中提供了一个示例，展示了如何针对 Hugging Face 的 llama-4-Scout-17B-16E-Instruct 模型应用这一技术。\n\n我们实现了两种 MoE 量化方法。第一种方法（在以下基准测试中称为 `base`）是直接扩展现有的量化张量子类，以对 3D 的 MoE 专家张量进行量化，并执行必要的索引和切片操作；第二种方法（`fake`）则使用一个新的张量子类来模拟 3D 量化参数，通过存储量化参数的多个 2D 切片序列来实现。第一种方法速度更快，但内存占用略差。无论哪种方法，这种 MoE 量化方式都不太可能达到与为每种技术实现融合 MoE 内核时相同的性能水平，不过它仍然可以在一定程度上提升速度并显著节省内存。\n\n以下基准测试是在单个 H100 GPU 上运行 Mixtral-MoE 模型的结果：\n\n|             | 批大小 1 |             | 批大小 8 |              |             |  \n|-------------|-------------|-------------|-------------|--------------|-------------|  \n| 技术        | tokens\u002Fs  | 内存 (GB)   | tokens\u002Fs  | tokens\u002Fs × 批 | 内存 (GB)   |  \n| 无量化      |       78.35 |       93.76 |        18.2 |       145.64 |       94.12 |  \n| int8wo-base |        98.4 |       48.87 |        4.94 |        39.56 |        49.2 |  \n| int4wo-base |       79.38 |       36.15 |       10.29 |        82.29 |       36.12 |  \n| fp8wo-base  |       59.41 |       52.07 |        2.98 |        23.81 |       52.05 |  \n| fp8dq-base  |       45.92 |       53.97 |        3.78 |        30.23 |       53.94 |  \n| int8wo-fake |        6.14 |      ","2025-05-09T22:08:17",{"id":211,"version":212,"summary_zh":213,"released_at":214},323929,"v0.10.0","## 亮点\n\n我们很高兴地宣布 torchao 0.10.0 版本的发布！此版本新增了对 NVIDIA B200 上 mxfp8 的端到端训练支持、PARQ（用于量化感知训练）、面向研究的模块交换量化 API，以及一些低比特核函数的更新！\n\n### 低比特优化器移至官方支持 ([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1864](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1864))\n\n[低比特优化器](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Freleases\u002Ftag\u002Fv0.4.0)（在 0.4 版中引入）已从原型阶段移出，现已成为 torchao 的官方支持功能。\n\n### \\[原型\\] NVIDIA B200 上 mxfp8 的端到端训练支持 (\\#1786, \\#1841, \\#1951, \\#1932, \\#1980)\n\n我们在 NVIDIA B200 上使用 torch.compile 实现了 [mxfp8](https:\u002F\u002Fwww.opencompute.org\u002Fdocuments\u002Focp-microscaling-formats-mx-v1-0-spec-final-pdf) 数据类型的端到端训练流程的早期版本。其中，cuBLAS 的 mxfp8 gemm 操作相较于 bfloat16 gemm 观测到的加速超过 **2 倍**，而从 bfloat16 到 mxfp8 的类型转换速率可达 **每秒 5.5 TB**。更多信息请参阅我们的 [MX 相关 README.md](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002Fmain\u002Ftorchao\u002Fprototype\u002Fmx_formats\u002FREADME.md)。我们计划在未来的版本中进一步提升性能。\n\n### \\[原型\\] 分段仿射正则化量化 ([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1738](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1738))\n\n* [PARQ](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fparq) 是一种通过正则化来实现量化的新理论框架。它支持标准的 QAT，同时也提供新的渐进式量化方法，并采用易于使用的仅优化器接口。量化过程中无需对模型的前向或反向传播进行任何修改。\n\n```py\nfrom torchao.prototype.parq.optim import QuantOptimizer, ProxHardQuant\nfrom torchao.prototype.parq.quant import UnifQuantizer\n\n# 将可量化与不可量化的参数组分开\nparam_groups = [\n    {\"params\": weights, \"quant_bits\": 2},  # 为 QAT 添加 quant_bits 键\n    {\"params\": others},\n]\n\n# 初始化任意 torch.optim.Optimizer\nbase_optimizer = torch.optim.SGD(param_groups, lr=0.1, momentum=0.9, weight_decay=1e-4)\n\n# 应用简单的包装器，在 optimizer.step() 中完成量化\noptimizer = QuantOptimizer(\n    base_optimizer, quantizer=UnifQuantizer(), prox_map=ProxHardQuant()\n)\n``` \n\n### \\[原型\\] 模块交换量化 API ([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1886](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1886))\n\n我们新增了一个用于训练后量化的原型 API。用户可以将其线性层或嵌入层替换为对应的 QuantizedLinear 和 QuantizedEmbedding 层，并设置量化器以指定如何对输入激活值或权重进行量化：\n\n```py\nquantized_linear = QuantizedLinear(...)\nquantized_linear.weight_quantization = IntQuantizer(\n    num_bits=4,\n    group_size=32,\n    dynamic=True,","2025-04-07T19:57:16",{"id":216,"version":217,"summary_zh":218,"released_at":219},323930,"v0.9.0","# 亮点\n\n我们很高兴地宣布 torchao 0.9.0 版本的发布！本次发布将多项稀疏化技术从原型阶段移出，对 quantize_ API 进行了重大重构，新增了用于 4 位动态量化的新 Cutlass 内核，还有更多改进！\n\n### 块稀疏化从原型阶段移出并正式启用  \n我们已将块稀疏化从 torchao.prototype 中移出，并进行了多项性能优化。您可以通过以下方式使用块稀疏化加速模型：\n\n```python\nfrom torchao.sparsity import sparsify, block_sparse_weight\nsparsify_(model, block_sparse_weight(blocksize=64))\n```\n\n##### 块稀疏化基准测试\n\n| **技术**                | **解码速度 (tok\u002Fs)** | **模型大小 (GB)** | \n|-------------------------|----------------------|-------------------|\n| 基线                    | 134.40               | 15.01             |\n| 2:4 稀疏                | 163.13               | 10.08             |\n| bsr-0.8-32              | 210.91               | 6.01              |\n| bsr-0.8-64              | 222.43               | 6.00              |\n| bsr-0.9-32              | 255.19               | 4.88              |\n| bsr-0.9-64              | 262.94               | 4.88              |\n| 2:4 稀疏 + int4wo (Marlin)| 255.21               | 3.89              |\n\n块稀疏化技术名称（bsr）表示稀疏率和块大小。\n\n这些数据是在 H100 上，使用 torchao\u002F_models\u002Fllama\u002Fgenerate.py 在 Meta-Llama-3.1-8B 模型上生成的。您可以使用此 [脚本](https:\u002F\u002Fgist.github.com\u002FHDCharles\u002F6e782c33d5aac24b36fa81d9e3bd5f5c) 复现这些结果。\n\n## 向后不兼容的变更\n\n### TorchAO M1 二进制目前无法正常工作  \n\n我们发现自 v0.8.0 起，M1 设备上的二进制文件就存在问题，而 v0.7.0 版本中它们是正常的。我们正在修复此问题，详细信息和讨论请参见 [此处](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F1796)。\n\n#### quantize_ 配置方式：从可调用对象改为配置对象（https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1595、https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1694、https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1696、https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1697）\n\n我们正在将 `quantize_` 工作流的配置方式从可调用对象（张量子类插入器）迁移到直接的配置对象。这样做的目的是与生态系统其他部分保持一致，支持在实例化后检查配置，并消除常见的混淆来源。\n\n**具体变化如下：**\n\n以下是 `quantize_` 第二个参数签名的变化：\n\n```python\n#\n# torchao v0.8.0 及之前版本\n#\ndef quantize(\n    model: torch.nn.Module,\n    apply_tensor_subclass: Callable[[torch.nn.Module], torch.nn.Module],\n    ...,\n): ...\n\n#\n# torchao v0.9.0\n#\ndef quantize(\n    model: torch.nn.Module,\n    config: Union[AOBaseConfig, Callable[[torch.nn.Module], torch.nn.Module]],\n    ...,\n): ...\n\n#\n# torchao v0.10.0 或更高版本（具体版本待定）\n#\ndef quantize(\n    model: torch.nn.Module,\n    ","2025-02-28T14:23:55",{"id":221,"version":222,"summary_zh":223,"released_at":224},323931,"v0.8.0","# 亮点\n\n我们很高兴地宣布 torchao 0.8.0 版本发布！在本次版本中，我们首次将 CUTLASS 内核引入 torchAO，新增了对 W4A8 线性算子的支持。此外，我们还为 torchAO 添加了 TTFT 基准测试，并比较了不同量化与稀疏化组合在预填充和解码阶段的加速效果。\n\n## 基于 CUTLASS 的 W4A8 实现\n\n我们实现了一个新的 W4A8 线性算子，对应于 int8\\_dynamic\\_activation\\_int4\\_weight 量化方案：该方案将两个 4 位权重打包成一个 8 位整数值。同时，我们将 CUTLASS 作为 torchao 仓库的一个子模块，以便更好地利用其功能来实现更多新内核。\n\n### A100 上的基准测试  \n| `-q 参数` | 平均 tokens\u002F秒 | 平均带宽（GB\u002Fs） | 峰值显存占用（GB） | 模型大小（GB） |\n| :--- | ---: | ---: | ---: | ---: |\n| | 95.24 | 258.55 | 13.90 | 13.21 |\n| `-q int8wo` | 155.31 | 1028.37 | 8.97 | 6.62 |\n| `-q int4wo-32` | 186.70 | 774.98 | 5.31 | 4.15 |\n| `-q int4wo-hqq` | 186.47 | 774.01 | 5.04 | 4.15 |\n| `-q int8dq` | 49.64 | 328.72 | 9.44 | 6.62 |\n| `-q w4a8-cutlass`（调优后） | 119.31 | 394.86 | 4.52 | 3.31 |\n\n## 预填充性能基准测试\n\n我们在 torchAO 中新增了 TTFT [基准测试](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1140)，并比较了不同量化与稀疏化组合在预填充和解码阶段的加速效果。在预填充阶段，系统主要受计算能力限制，因此动态量化相比仅量化权重的方案能带来更大的加速；而后者在预填充时速度更快。此外，我们还增加了一个 [选项](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1436)，用于在 LLM 解码过程中选择性地启用 int8 动态量化进行预填充。\n\n![截图 2025-01-15 上午10:06:09](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F06a029db-db48-4053-9c7b-9e6a47d9361f)\n\n## 向后不兼容变更\n\n### 从 float8 训练中移除仅使用 all-gather 的 float8 功能 ([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1451](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1451))\n\n`use_fp8_all_gather_only` 是一个实验性标志，默认关闭，据我们所知并未对外宣传或被任何用户使用。为了简化代码，我们决定将其移除。\n\n**修改前**\n\n```python\nconfig = Float8LinearConfig(\n...,\n# 下面的选项将被移除\nuse_fp8_all_gather_only = True,  \n)  \nconvert_to_float8_training(model, config=config, ...)\n```\n\n**修改后**\n\n`use_fp8_all_gather_only` 选项已不再支持。\n\n## 新特性\n\n* 添加 TTFT 基准测试，并更新稀疏化基准测试 ([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1140](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1140))   \n* 将 Gemlite 集成到 torchao 中 ([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1034](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1034))   \n* 基于 CUTLASS 的 W4A8 实现 ([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F880](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F880)) \n\n## 改进\n\n### quantize_\n\n* 将 zero_point_domain 公开为参数 ([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1401](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F14","2025-01-15T18:25:49",{"id":226,"version":227,"summary_zh":228,"released_at":229},323932,"v0.7.0-rc3","# Highlights\r\n\r\nWe are excited to announce the 0.7.0 release of torchao! This release moves QAT out of prototype with improved LoRA support and more flexible APIs, and adds support for new experimental kernels such as Marlin QQQ (for CUDA),  `int8_dynamic_activation_intx_weight` (for ARM CPU), and more!\r\n\r\n## QAT moved out of prototype, LoRA integration, new flexible APIs (#1020, #1085, #1152, #1037, #1152)\r\n\r\nQAT has been moved out of prototype to `torchao\u002Fquantization\u002Fqat` to provide better API stability guarantees moving forward. In addition to the existing `*QATQuantizer` classes, we now also support the more flexible `FakeQuantizedLinear` and `FakeQuantizedEmbedding` modules for users to configure the exact quantization settings they wish to use during QAT.\r\n\r\n```python\r\nfrom torchao.quantization.qat.api import FakeQuantizeConfig\r\nfrom torchao.quantization.qat.embedding import FakeQuantizedEmbedding\r\nfrom torchao.quantization.qat.linear import FakeQuantizedLinear\r\n\r\n# Specify quantization schemes to use during QAT\r\nactivation_config = FakeQuantizeConfig(torch.int8, \"per_token\", is_symmetric=False)\r\nweight_config = FakeQuantizeConfig(torch.int4, group_size=8)\r\n\r\n# Replace nn.Linear and nn.Embedding with these in your model\r\nfq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config)\r\nfq_embedding = FakeQuantizedEmbedding(16, 32, weight_config=weight_config)\r\n```\r\n\r\nWe also leveraged the new flexible APIs to build a new QAT + LoRA fine-tuning flow in torchtune. Try it out today!\r\n\r\n```bash\r\ntune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config llama3\u002F8B_qat_lora\r\n```\r\n## Marlin QQQ for CUDA (#1113)\r\n\r\nMarlin QQQ is an optimized GPU kernel that supports W4A8 mixed precision GEMM. For more details about Marlin QQQ, please refer to [paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.09904).\r\n\r\n```python\r\nfrom torchao.dtypes import MarlinQQQLayout\r\nquantize_(\r\n    model,\r\n    int8_dynamic_activation_int4_weight(\r\n        group_size=128,\r\n        mapping_type=MappingType.SYMMETRIC,\r\n        act_mapping_type=MappingType.SYMMETRIC,\r\n        layout=MarlinQQQLayout(),\r\n    ),\r\n)\r\n```\r\n\r\nBenchmarking results can be found in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002Fmain\u002Ftorchao\u002Fquantization\u002FREADME.md#marlin-qqq. \r\n\r\nThis is a prototype feature - feel free to try out!\r\n\r\n## int8_dynamic_activation_intx_weight Quantization for ARM CPU (#995, #1027, #1254, #1353)\r\n\r\nWe have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon).\r\n\r\n```python\r\nfrom torchao.experimental.quant_api import int8_dynamic_activation_intx_weight\r\nassert precision == torch.float32, \"int8_dynamic_activation_intx_weight requires fp32 precision\"\r\n\r\n# Build kernels in temp location, and load them in torch\r\n# This requires an ARM CPU\r\nfrom torchao.experimental.temp_build import temp_build_and_load_torchao_ops\r\ntemp_build_and_load_torchao_ops(cmake_lists_path=os.path.dirname(os.path.realpath(__file__)) + \"\u002F..\u002F..\u002Fexperimental\")\r\n# Quantize model\r\nnbit = 4\r\nassert nbit >= 1 and nbit \u003C= 8, \"nbits must be 1 to 8\"\r\ngroup_size = 128\r\nhas_weight_zeros = False\r\nquantize_(\r\n    model,\r\n    int8_dynamic_activation_intx_weight(\r\n        group_size=group_size,\r\n        nbit=nbit,\r\n        has_weight_zeros=has_weight_zeros,\r\n    ),\r\n)\r\n```\r\n\r\nBenchmarking results can be found in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002Fmain\u002Ftorchao\u002Fquantization\u002FREADME.md#int8_dynamic_activation_intx_weight-quantization \r\n\r\nWe are still trying to figure out how to ship the ARM CPU kernels, so the exact API is subject to change.\r\n\r\n## BC Breaking\r\n\r\n### Rename AQT#2 LayoutType -> Layout (#[1049](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F1049))\r\n\r\nBefore:\r\n\r\n```\r\nfrom torchao.dtypes import (\r\n    BlockSparseLayoutType,\r\n    Int4CPULayoutType,\r\n    MarlinQQQLayoutType,\r\n    MarlinSparseLayoutType,\r\n    SemiSparseLayoutType,\r\n    TensorCoreTiledLayoutType,\r\n    UintxLayoutType,\r\n    Float8LayoutType,\r\n    LayoutType,\r\n    PlainLayoutType,\r\n)\r\n```\r\n\r\nAfter:\r\n\r\n```\r\nfrom torchao.dtypes import (\r\n    BlockSparseLayout,\r\n    Int4CPULayout,\r\n    MarlinQQQLayout,\r\n    MarlinSparseLayout,\r\n    SemiSparseLayout,\r\n    TensorCoreTiledLayout,\r\n    UintxLayout,\r\n    Float8Layout,\r\n    Layout,\r\n    PlainLayout,\r\n)\r\n```\r\n\r\n### QAT imports after move out of prototype (#1091)\r\n\r\nBefore:\r\n\r\n```python\r\nfrom torchao.quantization.prototype.qat import (\r\n    disable_4w_fake_quant,\r\n    disable_8da4w_fake_quant,\r\n    enable_4w_fake_quant,\r\n    enable_8da4w_fake_quant,\r\n    ComposableQATQuantizer,\r\n    Int4WeightOnlyQATQuantizer,\r\n    Int4WeightOnlyEmbeddingQATQuantizer\r\n    Int8DynActInt4WeightQATQuantizer,\r\n    Int8DynActInt4WeightQATLinear,\r\n)\r\nfrom torchao.quantization.prototype.qat.api import (\r\n    FakeQuantizeConfig,\r\n)\r\nfrom torchao.quantization.prototype.qat.fake_quantizer import (\r\n    FakeQuantizer,\r\n)\r\n```\r\n\r\nAfter:\r\n\r\n","2024-12-06T22:13:17",{"id":231,"version":232,"summary_zh":233,"released_at":234},323933,"v0.6.1","## Highlights\r\n\r\nWe are excited to announce the 0.6.1 release of torchao! This release adds support for Auto-Round support,  Float8 Axiswise scaled training, a BitNet training recipe, an implementation of AWQ and much more!\r\n\r\n### Auto-Round Support (#581)\r\nAuto-Round is a new weight-only quantization algorithm, it has as achieved superior accuracy compared to [GPTQ](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.17323), [AWQ](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00978), and [OmniQuant](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.13137) across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05516), [GitHub repository](https:\u002F\u002Fgithub.com\u002Fintel\u002Fauto-round\u002Fblob\u002Fmain\u002Fdocs\u002Facc.md?rgh-link-date=2024-07-22T01%3A42%3A54Z), and Hugging Face [low-bit quantization leaderboard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FIntel\u002Flow_bit_open_llm_leaderboard).\r\n\r\n``` Python\r\nfrom torchao.prototype.autoround.core import prepare_model_for_applying_auto_round_\r\nfrom torchao.prototype.autoround.core import apply_auto_round\r\n\r\nprepare_model_for_applying_auto_round_(\r\n    model,\r\n    is_target_module=is_target_module,\r\n    bits=4,\r\n    group_size=128,\r\n    iters=200,\r\n    device=device,\r\n)\r\n\r\ninput_ids_lst = []\r\nfor data in dataloader:\r\n    input_ids_lst.append(data[\"input_ids\"].to(model_device))\r\n\r\nmulti_t_input_ids = MultiTensor(input_ids_lst)\r\nout = model(multi_t_input_ids)\r\n\r\nquantize_(model, apply_auto_round(), is_target_module)\r\n```\r\n### Added float8 training axiswise scaling support with per-gemm-argument configuration (#940)\r\n\r\nWe added experimental support for rowwise scaled float8 gemm to `torchao.float8`, with per-gemm-input configurability to enable exploration of various recipes.  Here is how a user can configure all-axiswise scaling\r\n\r\n```python\r\n# all-axiswise scaling\r\nconfig = torchao.float8.config.recipe_name_to_linear_config(Float8LinearRecipeName.ALL_AXISWISE)\r\nm = torchao.float8.convert_to_float8_training(config)\r\n\r\n# or, a custom recipe by @lw where grad_weight is left in bfloat16\r\nconfig = torchao.float8.config.recipe_name_to_linear_config(Float8LinearRecipeName.LW_AXISWISE_WITH_GW_HP)\r\nm = torchao.float8.convert_to_float8_training(config)\r\n```\r\n\r\nEarly performance benchmarks show all-axiswise scaling achieve a 1.13x speedup vs bf16 on torchtitan \u002F LLaMa 3 8B \u002F 8 H100 GPUs (compared to 1.17x from all-tensorwise scaling in the same setup), and loss curves which match to bf16 and all-tensorwise scaling.  Further performance and accuracy benchmarks will follow in future releases.\r\n\r\n### Introduced BitNet b1.58 training recipe (#930)\r\nAdds recipe for doing BitNet b1.58](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17764) ternary weights clamping. \r\n``` Python\r\nfrom torchao.prototype.quantized_training import bitnet_training\r\nfrom torchao import quantize_\r\n\r\nmodel = ...\r\nquantize_(model, bitnet_training())\r\n```\r\nNotably: Our implementation utilizes INT8 Tensor Cores to make up for this loss in speed. In fact, our implementation is faster than BF16 training in most cases.\r\n\r\n### [Prototype] Implemented Activation Aware Weight Quantization [AWQ](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.00978) (#743)\r\nPerplexity and performance measured on A100 GPU:\r\n| Model              | Quantization | Tokens\u002Fsec | Throughput (GB\u002Fsec) | Peak Mem (GB) | Model Size (GB) |\r\n|--------------------|--------------|------------|---------------------|---------------|-----------------|\r\n| Llama-2-7b-chat-hf | bfloat16     | 107.38     | 1418.93             | 13.88         | 13.21           |\r\n|                    | awq-hqq-int4 | 196.6      | 761.2               | 5.05          | 3.87            |\r\n|                    | awq-uint4    | 43.59      | 194.93              | 7.31          | 4.47            |\r\n|                    | int4wo-hqq   | 209.19     | 804.32              | 4.89          | 3.84            |\r\n|                    | int4wo-64    | 201.14     | 751.42              | 4.87          | 3.74            |\r\n\r\nUsage:\r\n\r\n```Python\r\nfrom torchao.prototype.awq import insert_awq_observer_, awq_uintx, AWQObservedLinear\r\nquant_dtype = torch.uint4\r\ngroup_size = 64\r\ncalibration_limit = 10\r\ncalibration_seq_length = 1024\r\nmodel=model.to(device)\r\ninsert_awq_observer_(model,calibration_limit, calibration_seq_length, quant_dtype=quant_dtype, group_size=group_size)\r\nwith torch.no_grad():\r\n    for batch in calibration_data:\r\n        model(batch.to(device))\r\nis_observed_linear = lambda m, fqn: isinstance(m, AWQObservedLinear)\r\nquantize_(model, awq_uintx(quant_dtype=quant_dtype, group_size = group_size), is_observed_linear)\r\n```   \r\n## New Features\r\n\r\n- [Prototype] Added Float8 support for AQT tensor parallel (#1003)\r\n- Added composable QAT quantizer (#938)\r\n- Introduced torchchat quantizer (#897)\r\n- Added INT8 mixed-precision training (#748)\r\n- Implemented sparse ","2024-10-21T21:45:23",{"id":236,"version":237,"summary_zh":238,"released_at":239},323934,"v0.5.0","## Highlights\r\nWe are excited to announce the 0.5 release of torchao! This release adds support for memory efficient inference, float8 training and inference, int8 quantized training, HQQ, automatic mixed-precision quantization through bayesian optimization, sparse marlin, and integrations with HuggingFace, SGLang, and diffusers.\r\n\r\n## Memory Efficient Inference Support https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F738\r\n\r\nWe've added support for Llama 3.1 to the llama benchmarks in TorchAO and added new features and improvements as a proof of concept for memory efficient inference. These additions allow us to to do *130k context length inference with Llama 3.1-8B with only 18.91 GB memory* if we combine with kv cache quantization, int4 weight only quantization and linear causal mask.\r\n\r\nGeneral savings depend on technique and context length as can be seen in the following graph:\r\n![image](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6d0b732c-159d-48e0-a6c7-d4977971fb80)\r\n\r\n## Float8 Training https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F551\r\n\r\n[torchao.float8](torchao\u002Ffloat8) implements training recipes with the scaled float8 dtypes, as laid out in https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.05433.\r\n\r\nWith ``torch.compile`` on, current results show throughput speedups of up to **1.5x on 128 H100 GPU LLaMa 3 70B pretraining jobs** ([details](https:\u002F\u002Fdev-discuss.pytorch.org\u002Ft\u002Fenabling-float8-all-gather-in-fsdp2\u002F2359))\r\n\r\n```python\r\nfrom torchao.float8 import convert_to_float8_training\r\nconvert_to_float8_training(m, module_filter_fn=...)\r\n```\r\n\r\nAnd for an end-to-minimal training recipe of pretraining with float8, you can check out [torchtitan](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fblob\u002Fmain\u002Fdocs\u002Ffloat8.md).\r\n\r\n## Float8 Inference https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F740 https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F819\r\n\r\nWe have introduced two new quantization APIs for Float8 inference:\r\n\r\n1. **Float8 Weight-Only Quantization**: A new quant_api [float8_weight_only()](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002F3f7fc14c88e30d4ff21ee3c4bb50f1aac4540409\u002Ftorchao\u002Fquantization\u002Fquant_api.py#L622) has been added to apply float8 weight-only symmetric per-channel quantization to linear layers.\r\n\r\n2. **Float8 Dynamic Activation and Weight Quantization**: A new quant_api [float8_dynamic_activation_float8_weight()](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002F3f7fc14c88e30d4ff21ee3c4bb50f1aac4540409\u002Ftorchao\u002Fquantization\u002Fquant_api.py#L697) has been introduced to apply float8 dynamic symmetric quantization to both activations and weights of linear layers. By default PerTensor scaling. We have also added an option to do [PerRow](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fblob\u002F3f7fc14c88e30d4ff21ee3c4bb50f1aac4540409\u002Ftorchao\u002Fquantization\u002Fobserver.py#L57) scaling of both activations and weights. By computing scales at a finer granularity, it can potentially reduce the overall quantization error and increase performance by reducing dynamic quantization overhead.\r\n\r\nExample usage:\r\n```python\r\nimport torch\r\nfrom torchao.quantization import quantize_, float8_weight_only, float8_dynamic_activation_float8_weight, PerRow\r\n\r\n# Create a model\r\nmodel = YourModel()\r\n\r\n# Apply float8 weight-only quantization\r\nquantize_(model, float8_weight_only())\r\n\r\n# Apply float8 dynamic activation and weight quantization\r\nquantize_(model, float8_dynamic_activation_float8_weight())\r\n\r\n# Apply PerRow scaling to weight and activations\r\nquantize_(linear_module, float8_dynamic_activation_float8_weight(granularity=PerRow())) \r\n```\r\n\r\nNotes:\r\n- These new APIs are designed to work with PyTorch 2.5 and later versions.\r\n- `float8_dynamic_activation_float8_weight` requires CUDA devices with compute capability 8.9 or higher for hardware acceleration.\r\n\r\n## Int8 quantized training #644 #748\r\n\r\n@gau-nernst introduced 2 experimental works on training using INT8.\r\n\r\n- **INT8 quantized training** (#644): weight is quantized to INT8 during the whole duration of training to save memory. Compute remains in high precision. To train the model effectively with only quantized weights, we use stochastic rounding for weight update. Right now, memory saving is not too competitive compared to compiled BF16 baseline.\r\n- **INT8 mixed-precision training** (#748): weight is kept in the original high precision, but weight and activation are dynamically quantized to INT8 during training to utilize INT8 tensor cores. We observe up to 70% speedup for Llama2 pre-training on 4090, and 20% speedup for Llama3 pre-training on 8x A100 with FSDP2.\r\n\r\n```python\r\nfrom torchao.quantization import quantize_\r\nfrom torchao.prototype.quantized_training import int8_weight_only_quantized_training, int8_mixed_precision_training\r\n\r\nmodel = YourModel()\r\n\r\n# apply INT8 quantized training\r\nquantize_(model, int8_weight_only_quantized_training())\r\n\r\n# apply INT8 mixed-precision training\r\nquantize_(model, int8_mixed_precision_training())\r\n```\r\n\r\nFor more information and benchmark results, see [README](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fv0.5.0-rc2\u002Ftorchao\u002Fprototype\u002Fquantized_training) and the re","2024-09-08T17:18:22",{"id":241,"version":242,"summary_zh":243,"released_at":244},323935,"v0.4.0","### v0.4.0\r\n\r\n## Highlights\r\n\r\nWe are excited to announce the 0.4 release of torchao! This release adds support for KV cache quantization, quantization aware training (QAT), low bit optimizer support, composing quantization and sparsity, and more!\r\n\r\n## KV cache quantization (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F532)\r\n\r\nWe've added support for KV cache quantization, showing a peak memory reduction from 19.7 ->  19.2 GB on Llama3-8B at an 8192 context length. We plan to investigate Llama3.1 next. \r\n\r\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F31946f46-e8eb-45c2-ac1c-3a7d981c58a2\" width=\"300\" height=\"auto\">\r\n\r\n\r\n\r\n## Quantization-Aware Training (QAT) ([#383](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F383), [#555](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F555))  \r\n\r\nWe now support two QAT schemes for linear layers: Int8 per token dynamic activations + int4 per group weights, and int4 per group weights (using the efficient [tinygemm int4 kernel](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fblob\u002Fa672f6c84e318bbf455f13dfdd3fd7c68a388bf5\u002Faten\u002Fsrc\u002FATen\u002Fnative\u002Fcuda\u002Fint4mm.cu#L1097) after training). Users can access this feature by transforming their models before and after training using the appropriate quantizer, for example:\r\n\r\n\r\n```python\r\nfrom torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer\r\n\r\n# Quantizer for int8 dynamic per token activations +\r\n# int4 grouped per channel weights, only for linear layers\r\nqat_quantizer = Int8DynActInt4WeightQATQuantizer()\r\n\r\n# Insert \"fake quantize\" operations into linear layers.\r\n# These operations simulate quantization numerics during\r\n# training without performing any dtype casting\r\nmodel = qat_quantizer.prepare(model)\r\n\r\n# Convert fake quantize to actual quantize operations\r\nmodel = qat_quantizer.convert(model)\r\n```\r\n\r\nInitial evaluation results indicate that QAT in torchao can recover up to 96% of quantized accuracy degradation on hellaswag and up to 68% of quantized perplexity degradation on wikitext for Llama3 compared to post-training quantization (PTQ). For more details, please refer to the [README](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Fquantization\u002Fprototype\u002Fqat) and [this blog post](https:\u002F\u002Fpytorch.org\u002Fblog\u002Fquantization-aware-training\u002F).\r\n\r\n## Composing quantization and sparsity (#457, #473)\r\n\r\nWe've added support for composing int8 dynamic quantization with 2:4 sparsity, using the `quantize_` API. We also added SAM benchmarks that show a 7% speedup over standalone sparsity \u002F int8 dynamic quantization [here](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Fsparsity#segment-anything-fast). \r\n\r\n```python\r\nfrom torchao.quantization import quantize_, int8_dynamic_activation_int8_semi_sparse_weight\r\nquantize_(model, int8_dynamic_activation_int8_semi_sparse_weight())\r\n```\r\n\r\n## Community Contributions\r\n\r\n## low-bit optimizer support (#478, #463, #482, #484, #538)\r\n\r\n@gau-nernst added implementations for 4-bit, 8-bit, and FP8 Adam with FSDP2\u002FFSDP support. Our API is a drop-in replacement for `torch.optim.Adam` and can be used as follows:\r\n```python\r\nfrom torchao.prototype.low_bit_optim import Adam8bit, Adam4bit, AdamFp8\r\nfrom torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit, AdamWFp8\r\n\r\n\r\nmodel = ...\r\noptim = Adam8bit(model.parameters()) # replace with Adam4bit and AdamFp8 for the 4 \u002F fp8 versions\r\n```\r\n\r\nFor more information about low bit optimizer support please refer to our [README](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Fprototype\u002Flow_bit_optim).\r\n\r\n## Improvements to 4-bit quantization (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F517, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F552, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F544, #479 ) \r\n\r\n@bdhirsh @jeromeku  @yanbing-j @manuelcandales @larryliu0820 added torch.compile support for NF4 Tensor, custom CUDA int4 tinygemm unpacking ops, and several bugfixes to torchao\r\n\r\n## BC breaking\r\n* `quantize` has been renamed to `quantize_` https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F467\r\n``` python\r\n# for torchao 0.4\r\nfrom torchao.quantization import quantize_, int8_weight_only\r\nquantize_(model, int8_weight_only())\r\n\r\n# for torchao 0.3\r\nfrom torchao.quantization import quantize, int8_weight_only\r\nquantize(model, int8_weight_only())\r\n```\r\n*  `apply_sparse_semi_structured` has been deprecated in favor of `sparsify_` which matches the `quantize_` API https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F473\r\n``` python\r\n# for torchao 0.4\r\nfrom torchao.sparsity import _sparsify, semi_sparse_weight\r\nsparsify_(model, semi_sparse_weight())\r\n\r\n# for torchao 0.3\r\nfrom torchao.sparsity import apply_sparse_semi_structured\r\napply_sparse_semi_structured(model)\r\n```\r\n\r\n## Deprecations\r\n\r\n\r\n## New Features\r\n* Added kv_cache quantization https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F532\r\n* Migrated float8_experimental to `torchao.float8`, enabling float8 training support https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F551 https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F529 \r\n* Added FP5 E2M2  https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F399\r\n* Added 4-bit, 8-bit, and FP8 ADAM support https:\u002F\u002Fgithub.com\u002Fpytorc","2024-08-07T16:48:51",{"id":246,"version":247,"summary_zh":248,"released_at":249},323936,"v0.3.0","### v0.3.1\r\n\r\n## Highlights\r\n\r\nWe are excited to announce the 0.3 release of torchao! This release adds support for a new quantize API, MX format, FP6 dtype and bitpacking, 2:4 sparse accelerated training and benchmarking infra for llama2\u002Fllama3 models. \r\n\r\n\r\n### `quantize` API (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F256)\r\nWe added a tensor subclass based quantization API, see [docs](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Fquantization) and README for details on usage, this is planned to replace all existing quantization APIs in torchao for torch 2.4 and later.\r\n\r\n### Accelerated training with 2:4 sparsity (#184)  \r\nYou can now accelerate training with 2:4 sparsity, using the runtime pruning + compression kernels written by xFormers. These kernels process a 4x4 sub-tile to be 2:4 sparse in both directions, to handle both the forward and backward pass when training. We see a [1.3x speedup](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Fsparsity\u002Ftraining#benchmarking) for the MLP layers of ViT-L across a forward and backwards pass.\r\n\r\n\r\n### MX support (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F264)\r\nWe added prototype support for MX format for training and inference with a reference native PyTorch implementation of training and inference primitives for using MX accelerated matrix multiplications. The MX numerical formats are new low precision formats with recent acceptance into the OCP spec:\r\nhttps:\u002F\u002Fwww.opencompute.org\u002Fdocuments\u002Focp-microscaling-formats-mx-v1-0-spec-final-pdf\r\n\r\n### Benchmarking (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F276, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F374)\r\nWe added a stable way to benchmark llama2 and llama3 models that includes perf\u002Faccuracy comparisons. See torchao\u002F_models\u002Fllama\u002Fbenchmarks.sh for more details. \r\n\r\n## 🌟 💥   Community Contributions 🌟 💥 \r\n### FP6 support (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F279, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F283, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F358)\r\n@gau-nernst Added support for FP6 dtype and mixed matmul FP16 x FP6 kernel with support for torch.compile. Benchmark results show a [2.3x speedup](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fissues\u002F208#issuecomment-2143240728) over BF16 baseline for meta-llama\u002FLlama-2-7b-chat-hf \r\n\r\n### Bitpacking (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F307, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F282)\r\n@vayuda, @melvinebenezer @CoffeeVampir3  @andreaskoepf Added support for packing\u002Funpacking lower bit dtypes leveraging torch.compile to generate the kernels for this and added UInt2 and Bitnet tensor based on this approach.\r\n\r\n### FP8 split-gemm kernel https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F263\r\nAdded the kernel written by @AdnanHoque to torchao with [speedups](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F263#issuecomment-2130284378) compared to the cuBLAS kernel for batch size \u003C=16  \r\n\r\n\r\n## BC Breaking\r\n\r\n## Deprecations\r\n* Deprecate top level quantization APIs https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F344\r\n\r\n### 1. int8 weight only quantization\r\n`apply_weight_only_int8_quant(model)` or `change_linear_weights_to_int8_woqtensors(model)`\r\n\r\n-->\r\n\r\n```python\r\n# for torch 2.4+\r\nfrom torchao.quantization import quantize, int8_weight_only\r\nquantize(model, int8_weight_only())\r\n\r\n# for torch 2.2.2 and 2.3\r\nfrom torchao.quantization.quant_api import change_linear_weights_to_int8_woqtensors\r\nchange_linear_weights_to_int8_woqtensors(model)\r\n```\r\n\r\n### 2. int8 dynamic quantization\r\n`apply_dynamic_quant(model)` or `change_linear_weights_to_int8_dqtensors(model)`\r\n\r\n-->\r\n\r\n```python\r\n# Fuse the int8*int8 -> int32 matmul and subsequent mul op avoiding materialization of the int32 intermediary tensor\r\ntorch._inductor.config.force_fuse_int_mm_with_mul = True\r\n\r\n# for torch 2.4+\r\nfrom torchao.quantization import quantize, int8_dynamic_activation_int8_weight\r\nquantize(model, int8_dynamic_activation_int8_weight())\r\n\r\n# for torch 2.2.2 and 2.3\r\nfrom torchao.quantization.quant_api import change_linear_weights_to_int8_dqtensors\r\nchange_linear_weights_to_int8_dqtensors(model)\r\n```\r\n\r\n### 3. int4 weight only quantization\r\n`change_linear_weights_to_int4_wotensors(model)`\r\n\r\n-->\r\n\r\n```python\r\n# for torch 2.4+\r\nfrom torchao.quantization import quantize, int4_weight_only\r\nquantize(model, int4_weight_only())\r\n\r\n# for torch 2.2.2 and 2.3\r\nfrom torchao.quantization.quant_api import change_linear_weights_to_int4_woqtensors\r\nchange_linear_weights_to_int4_woqtensors(model)\r\n```\r\n\r\n\r\n## New Features\r\n* Add `quantize` https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F256\r\n* Add a prototype of MX format training and inference  https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F264\r\n* [FP6-LLM] Port splitK map from DeepSpeed  https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F283\r\n* Improve FP6-LLM 2+4bit weight splitting + user API https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F279\r\n* Bitpacking  https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F291\r\n* training acceleration via runtime semi-structured sparsity  https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F184\r\n* Bitpackingv2  https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F307\r\n* Add FP6-LLM doc and move FP6-LLM to prototype  https:\u002F\u002Fgithub.com\u002Fp","2024-06-26T20:36:50",{"id":251,"version":252,"summary_zh":253,"released_at":254},323937,"v0.2.0","## What's Changed\r\n\r\n## Highlights\r\n\r\n### Custom CPU\u002FCUDA extension to ship CPU\u002FCUDA binaries.\r\n\r\nPyTorch core has recently shipped a new custom op registration mechanism with [torch.library](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Flibrary.html) with the benefit being that custom ops will compose with as many PyTorch subsystems as possible most notably NOT graph breaking with `torch.compile()`\r\n\r\nWe'd added some documentation for how you could register your own custom ops https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Fcsrc and if you learn better via example you can follow this PR https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F135 to add your own custom ops to `torchao`. \r\n\r\nMost notably these instructions were leveraged by @gau-nernst to integrate some new custom ops for `fp6` support https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F223\r\n\r\nOne key benefit of integrating your kernels in `torchao` directly is we thanks to our `manylinux` GPU support can ensure that CPU\u002FCUDA kernels that you've added will work on as many devices and cuda versions as possible https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F176\r\n\r\n### A lot of prototype and community contributions\r\n\r\n@jeromeku was our community champion merging support for \r\n1. [GaLore](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.03507) our first pretraining kernel that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch\r\n2. [DoRA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.09353) which has been shown to yield superior fine-tuning accuracy results than QLoRA. This is an area where the community can help us benchmark more thoroughly https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Fprototype\u002Fdora\r\n3. Fused int4\u002Ffp16 quantized matmul which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Fprototype\u002Fhqq\r\n\r\n@gau-nernst merged [fp6](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.14112) support showing up to 8x speedups on an fp16 baseline for small batch size inference https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F223\r\n\r\n\r\n### NF4 support for upcoming [FSDP2](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fissues\u002F114299)\r\n\r\n@weifengpy merged support for composing FSDP2 with NF4 which makes it easy to implement algorithms like QLoRA + FSDP without writing any CUDA or C++ code. This work also provides a blueprint for how to compose smaller dtypes with FSDP https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F150 most notably by implementing `torch.chunk()`. We hope the broader community uses this work to experiment more heavily at the intersection of distributed and quantization research and inspires many more studies such as the ones done by Answer.ai https:\u002F\u002Fwww.answer.ai\u002Fposts\u002F2024-03-06-fsdp-qlora.html\r\n\r\n## BC breaking\r\n\r\n## Deprecations\r\n\r\n## New Features\r\n* Match autoquant API with torch.compile (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F109, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F162, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F175)\r\n* [Prototype] 8da4w QAT (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F138, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F199, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F198, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F211, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F154, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F157, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F229)\r\n* [Prototype] GaLore (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F95)\r\n* [Prototype] DoRA (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F216)\r\n* [Prototype] HQQ (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F153, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F185)\r\n* [Prototype] 2:4 sparse + int8 sparse subclass (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F36)\r\n* [Prototype] Unified quantization primitives (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F159, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F201, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F193, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F220, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F227, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F173, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F210)\r\n* [Prototype] Pruning primitives (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F148, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F194)\r\n* [Prototype] AffineQuantizedTensor subclass (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F214, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F230, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F243, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F247, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F251)\r\n* [Prototype] Add `Int4WeightOnlyQuantizer` (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F119)\r\n* Custom CUDA extensions (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F135, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F186, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F232)\r\n* [Prototype] Add FP6 Linear (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F223)\r\n\r\n\r\n## Improvements\r\n* [FSDP2](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fissues\u002F114299) support for NF4Tensor (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F118, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F150, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F207)\r\n* Add save\u002Fload of int8 weight only quantized model (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F122)\r\n* Add int_scaled_mm on CPU (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Fpull\u002F121)\r\n* Add cpu and gpu in int4wo and int4wo-gptq quantizer (https:","2024-05-20T20:52:15",{"id":256,"version":257,"summary_zh":258,"released_at":259},323938,"v0.1","# Highlights\r\nWe’re excited to announce the release of TorchAO v0.1.0! TorchAO is a repository to host architecture optimization techniques such as quantization and sparsity and performance kernels on different backends such as CUDA and CPU. In this release, we added support for a few quantization techniques like int4 weight only GPTQ quantization, added nf4 dtype support for QLoRA and sparsity features like WandaSparsifier, we also added autotuner that can tune triton integer matrix multiplication kernels on cuda.\r\n\r\nNote: TorchAO is currently in a pre-release state and under extensive development. The public APIs should not be considered stable. But we welcome you to try out our APIs and offerings and provide any feedback on your experience.\r\n\r\ntorchao 0.1.0 will be compatible with PyTorch 2.2.2 and 2.3.0, ExecuTorch 0.2.0 and TorchTune 0.1.0.\r\n\r\n# New Features\r\n## Quantization\r\n* Added tensor subclass based quantization APIs: `change_linear_weights_to_int8_dqtensors`, `change_linear_weights_to_int8_woqtensors` and `change_linear_weights_to_int4_woqtensors` (#1)\r\n* Added module based quantization APIs for int8 dynamic and weight only quantization `apply_weight_only_int8_quant` and `apply_dynamic_quant` (#1)\r\n* Added module swap version of int4 weight only quantization `Int4WeightOnlyQuantizer` and `Int4WeightOnlyGPTQQuantizer` used in TorchTune (#119, #116)\r\n* Added int8 dynamic activation and int4 weight quantization `Int8DynActInt4WeightQuantizer` and `Int8DynActInt4WeightGPTQQuantizer`, used in ExecuTorch (#74) (available after torch 2.3.0 and later)\r\n## Sparsity\r\n* Added `WandaSparsifier` that prunes both weights and activations (#22)\r\n## Kernels\r\n* Added `autotuner` for int mm Triton kernels (#41)\r\n## dtypes\r\n* `nf4` tensor subclass and `nf4` linear (#37, #40, #62)\r\n* Added `uint4` dtype tensor subclass (#13)\r\n\r\n# Improvements\r\n* Setup github workflow for regression testing (#50)\r\n* Setup github workflow for `torchao-nightly` release (#54)\r\n\r\n## Documentation\r\n* Added tutorials for quantizing vision transformer model (#60)\r\n* Added tutorials for how to add an op for `nf4` tensor (#54)\r\n\r\n## Notes\r\n* we are still debugging the accuracy problem for `Int8DynActInt4WeightGPTQQuantizer`\r\n* Save and load does not work well for tensor subclass based APIs yet\r\n* We will consolidate tensor subclass and module swap based quantization APIs later\r\n* `uint4` tensor subclass is going to be merged into pytorch core in the future\r\n* Quantization ops in `quant_primitives.py` will be deduplicated with similar quantize\u002Fdequantize ops in PyTorch later\r\n\r\n\r\n","2024-04-04T23:18:20"]