[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-pytorch--torchtitan":3,"tool-pytorch--torchtitan":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":98,"env_os":99,"env_gpu":100,"env_ram":101,"env_deps":102,"category_tags":108,"github_topics":78,"view_count":23,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":109,"updated_at":110,"faqs":111,"releases":141},2277,"pytorch\u002Ftorchtitan","torchtitan","A PyTorch native platform for training generative AI models","torchtitan 是 PyTorch 官方推出的原生平台，专为生成式 AI 模型的大规模训练与快速实验而设计。它旨在解决开发者在构建大模型时面临的分布式训练复杂度高、代码修改繁琐以及基础设施难以灵活扩展等痛点。\n\n通过提供一套极简且干净的代码实现，torchtitan 让用户在应用多维并行策略（如 FSDP2 分片、张量并行及异步张量并行）时，几乎无需改动原有的模型代码。这种“开箱即用”的特性极大地降低了大规模训练的门槛，同时保留了高度的可定制性，支持用户通过扩展点轻松集成自定义组件。目前，它已成功应用于 Llama 3.1 等系列大模型的预训练验证中。\n\n这款工具非常适合 AI 研究人员、深度学习工程师以及对大模型底层训练架构感兴趣的开发者使用。如果你希望深入探索最新的分布式训练技术，或者需要一个灵活的基座来验证新的模型架构与基础设施方案，torchtitan 将是一个理想的选择。它不仅展示了 PyTorch 在分布式领域的最新成果，更致力于通过简洁高效的设计，加速生成式 AI 领域的创新步伐。","\u003Cdiv align=\"center\">\n\n# torchtitan\n\n#### A PyTorch native platform for training generative AI models\n\n[![8 GPU Feature Tests](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Factions\u002Fworkflows\u002Fintegration_test_8gpu_features.yaml\u002Fbadge.svg?branch=main)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Factions\u002Fworkflows\u002Fintegration_test_8gpu_features.yaml?query=branch%3Amain)\n[![8 GPU Model Tests](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Factions\u002Fworkflows\u002Fintegration_test_8gpu_models.yaml\u002Fbadge.svg?branch=main)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Factions\u002Fworkflows\u002Fintegration_test_8gpu_models.yaml?query=branch%3Amain)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2410.06511-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06511)\n[![ICLR](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICLR-2025-violet.svg)](https:\u002F\u002Ficlr.cc\u002Fvirtual\u002F2025\u002Fposter\u002F29620)\n[![forum](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpytorch-forum-DE3412.svg)](https:\u002F\u002Fdiscuss.pytorch.org\u002Fc\u002Fdistributed\u002Ftorchtitan\u002F44)\n[![license](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-BSD_3--Clause-lightgrey.svg)](.\u002FLICENSE)\n[![pip](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftorchtitan?color=blue)](https:\u002F\u002Fpypi.org\u002Fproject\u002Ftorchtitan\u002F)\n[![conda](https:\u002F\u002Fimg.shields.io\u002Fconda\u002Fvn\u002Fconda-forge\u002Ftorchtitan?color=green)](https:\u002F\u002Fanaconda.org\u002Fconda-forge\u002Ftorchtitan)\n\n\n\u003C\u002Fdiv>\n\n`torchtitan` is under extensive development. To use the latest features of `torchtitan`, we recommend using the most recent PyTorch nightly.\n\n\n## Latest News\n- [2025\u002F11] AMD released an [optimized fork](https:\u002F\u002Fgithub.com\u002FAMD-AGI\u002Ftorchtitan-amd\u002Ftree\u002Fmain) of `torchtitan` for AMD GPUs.\n- [2025\u002F10] We released `torchtitan` [v0.2.0](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Freleases).\n- [2025\u002F10] SkyPilot now supports `torchtitan`! See the tutorial [here](https:\u002F\u002Fdocs.skypilot.co\u002Fen\u002Flatest\u002Fexamples\u002Ftraining\u002Ftorchtitan.html).\n- [2025\u002F07] We published [instructions](\u002Ftorchtitan\u002Fmodels\u002FREADME.md) on how to add a model to `torchtitan`.\n- [2025\u002F04] Our paper was accepted by [ICLR 2025](https:\u002F\u002Ficlr.cc\u002Fvirtual\u002F2025\u002Fposter\u002F29620).\n- [2024\u002F12] GPU MODE [lecture](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=VYWRjcUqW6w) on torchtitan.\n- [2024\u002F07] [Presentation](https:\u002F\u002Fpytorch2024.sched.com\u002Fevent\u002F1fHn3) at PyTorch Conference 2024.\n\n\n## Overview\n\n`torchtitan` is a PyTorch native platform designed for **rapid experimentation and large-scale training** of generative AI models. As a minimal clean-room implementation of PyTorch native scaling techniques, `torchtitan` provides a flexible foundation for developers to build upon. With `torchtitan` [extension points](docs\u002Fextension.md), one can easily create custom extensions tailored to specific needs.\n\nOur mission is to accelerate innovation in the field of generative AI by empowering researchers and developers to explore new modeling architectures and infrastructure techniques.\n\nThe Guiding Principles when building `torchtitan`\n* Designed to be easy to understand, use and extend for different training purposes.\n* Minimal changes to the model code when applying multi-dimensional parallelism.\n* Bias towards a clean, minimal codebase while providing basic reusable \u002F swappable components.\n\n`torchtitan` has been showcasing PyTorch's latest distributed training features, via support for pretraining Llama 3.1 LLMs of various sizes.\n\n## Contributing\n\nWe look forward to your contributions!\n\n* To accelerate contributions to and innovations around torchtitan, we host an [`experiments`](torchtitan\u002Fexperiments) folder. New ideas should start there. To contribute, follow the [`experiments guidelines`](torchtitan\u002Fexperiments\u002FREADME.md).\n* For fixes and contributions to core, follow these [`guidelines`](CONTRIBUTING.md).\n\n## Llama 3.1 training\n\n### Key features available\n\n1. Multi-dimensional composable parallelisms\n   - [FSDP2](docs\u002Ffsdp.md) with per-parameter sharding\n   - [Tensor Parallel](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fdistributed.tensor.parallel.html) (including [async TP](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fdistributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch\u002F209487))\n   - [Pipeline Parallel](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fdistributed-w-torchtitan-training-with-zero-bubble-pipeline-parallelism\u002F214420)\n   - [Context Parallel](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fdistributed-w-torchtitan-breaking-barriers-training-long-context-llms-with-1m-sequence-length-in-pytorch-using-context-parallel\u002F215082)\n2. [Meta device](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fmeta.html) initialization\n3. Per-op selective and full activation checkpointing\n4. [Distributed checkpointing](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fdistributed-w-torchtitan-optimizing-checkpointing-efficiency-with-pytorch-dcp\u002F211250) (including async checkpointing)\n   - [Interoperable checkpoints](docs\u002Fcheckpoint.md) which can be loaded directly into [`torchtune`](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtune) for fine-tuning\n5. `torch.compile` support\n6. [Float8](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fdistributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2\u002F209323) support ([how-to](docs\u002Ffloat8.md))\n7. [MXFP8 training for dense and MoE models](docs\u002Fmxfp8.md) on Blackwell GPUs.\n7. DDP and HSDP\n8. [TorchFT](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchft) integration\n9. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries) and support for [custom datasets](docs\u002Fdatasets.md)\n10. Gradient accumulation, enabled by giving an additional `--training.global_batch_size` argument on the CLI\n11. Flexible learning rate scheduler (warmup-stable-decay)\n12. Loss, GPU memory, throughput (tokens\u002Fsec), TFLOPs, and MFU displayed and logged via [Tensorboard or Weights & Biases](\u002Fdocs\u002Fmetrics.md)\n13. [Debugging tools](docs\u002Fdebugging.md) including CPU\u002FGPU profiling, memory profiling, Flight Recorder, etc.\n14. All options easily configured via [Python config registry](torchtitan\u002Fmodels\u002Fllama3\u002Fconfig_registry.py) with `--module` and `--config` CLI flags\n15. [Helper scripts](scripts\u002F) to\n    - download tokenizers from Hugging Face\n    - convert original Llama 3 checkpoints into the expected DCP format\n    - estimate FSDP\u002FHSDP memory usage without materializing the model\n    - run distributed inference with Tensor Parallel\n\nWe report [performance](benchmarks\u002Fllama3_h100_202412_torchtitan.md) on up to 512 GPUs, and verify [loss converging](docs\u002Fconverging.md) correctness of various techniques.\n\n### Dive into the code\n\nYou may want to see how the model is defined or how parallelism techniques are applied. For a guided tour, see these files first:\n* [torchtitan\u002Ftrain.py](torchtitan\u002Ftrain.py) - the main training loop and high-level setup code\n* [torchtitan\u002Fmodels\u002Fllama3\u002Fmodel.py](torchtitan\u002Fmodels\u002Fllama3\u002Fmodel.py) - the Llama 3.1 model definition\n* [torchtitan\u002Fmodels\u002Fllama3\u002Fparallelize.py](torchtitan\u002Fmodels\u002Fllama3\u002Fparallelize.py) - helpers for applying Data Parallel, Tensor Parallel, activation checkpointing, and `torch.compile` to the model\n* [torchtitan\u002Fdistributed\u002Fpipeline_parallel.py](torchtitan\u002Fdistributed\u002Fpipeline_parallel.py) - helpers for applying Pipeline Parallel to the model\n* [torchtitan\u002Fcomponents\u002Fcheckpoint.py](torchtitan\u002Fcomponents\u002Fcheckpoint.py) - utils for saving\u002Floading distributed checkpoints\n* [torchtitan\u002Fcomponents\u002Fquantization\u002Ffloat8.py](torchtitan\u002Fcomponents\u002Fquantization\u002Ffloat8.py) - utils for applying Float8 techniques\n\n\n## Installation\n\nOne can directly run the source code, or install `torchtitan` from a nightly build, or a stable release.\n\n### From source\n\nThis method requires the nightly build of PyTorch, or the latest PyTorch built [from source](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch?tab=readme-ov-file#from-source).\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\ncd torchtitan\npip install -r requirements.txt\npip install --pre torchdata --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcpu\n```\n\n> **Note:** The nightly build of `torchdata` is required when using a PyTorch nightly. Install it from the nightly index as shown above.\n\n### Nightly builds\n\nThis method requires the nightly build of PyTorch. You can replace `cu128` with another version of cuda or an AMD GPU (e.g. `rocm6.3`).\n\n```sh\npip3 install --pre torch --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu128 --force-reinstall\npip install --pre torchtitan --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu128\n```\n\n### Stable releases\nOne can install the latest [stable release](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Freleases) of `torchtitan` via `pip` or `conda`.\n```sh\npip install torchtitan\n```\n```sh\nconda install conda-forge::torchtitan\n```\nNote that each stable release pins the nightly versions of `torch` and `torchao`. Please see [release.md](docs\u002Frelease.md) for more details.\n\n### Downloading a tokenizer\n\n`torchtitan` currently supports training Llama 3.1 (8B, 70B, 405B) out of the box. To get started training these models, we need to download the tokenizer. Follow the instructions on the official [meta-llama](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.1-8B) repository to ensure you have access to the Llama model weights.\n\nOnce you have confirmed access, you can run the following command to download the Llama 3.1 tokenizer to your local machine.\n\n```bash\n# Get your HF token from https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens\n\n# Llama 3.1 tokenizer\npython scripts\u002Fdownload_hf_assets.py --repo_id meta-llama\u002FLlama-3.1-8B --assets tokenizer --hf_token=...\n```\n\n### Start a training run\nLlama 3 8B model locally on 8 GPUs\n\n```bash\nMODULE=llama3 CONFIG=llama3_8b .\u002Frun_train.sh\n```\n\n### Multi-Node Training\nFor training on ParallelCluster\u002FSlurm type configurations, you can use the `multinode_trainer.slurm` file to submit your sbatch job.\n\nTo get started adjust the number of nodes and GPUs\n```\n#SBATCH --ntasks=2\n#SBATCH --nodes=2\n```\n\nThen start a run where `nnodes` is your total node count, matching the sbatch node count above.\n\n```\nsrun torchrun --nnodes 2\n```\n\nIf your gpu count per node is not 8, adjust `--nproc_per_node` in the torchrun command and `#SBATCH --gpus-per-task` in the SBATCH command section.\n\n\n## Citation\n\nWe provide a detailed look into the parallelisms and optimizations available in `torchtitan`, along with summary advice on when to use various techniques.\n\n[TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training](https:\u002F\u002Fopenreview.net\u002Fforum?id=SFN6Wm7YBI)\n```\n@inproceedings{\n   liang2025torchtitan,\n   title={TorchTitan: One-stop PyTorch native solution for production ready {LLM} pretraining},\n   author={Wanchao Liang and Tianyu Liu and Less Wright and Will Constable and Andrew Gu and Chien-Chin Huang and Iris Zhang and Wei Feng and Howard Huang and Junjie Wang and Sanket Purandare and Gokul Nadathur and Stratos Idreos},\n   booktitle={The Thirteenth International Conference on Learning Representations},\n   year={2025},\n   url={https:\u002F\u002Fopenreview.net\u002Fforum?id=SFN6Wm7YBI}\n}\n```\n\n\n## License\n\nSource code is made available under a [BSD 3 license](.\u002FLICENSE), however you may have other legal obligations that govern your use of other content linked in this repository, such as the license or terms of service for third-party data and models.\n","\u003Cdiv align=\"center\">\n\n# torchtitan\n\n#### 一个原生基于 PyTorch 的生成式 AI 模型训练平台\n\n[![8 GPU 功能测试](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Factions\u002Fworkflows\u002Fintegration_test_8gpu_features.yaml\u002Fbadge.svg?branch=main)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Factions\u002Fworkflows\u002Fintegration_test_8gpu_features.yaml?query=branch%3Amain)\n[![8 GPU 模型测试](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Factions\u002Fworkflows\u002Fintegration_test_8gpu_models.yaml\u002Fbadge.svg?branch=main)](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Factions\u002Fworkflows\u002Fintegration_test_8gpu_models.yaml?query=branch%3Amain)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2410.06511-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06511)\n[![ICLR](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICLR-2025-violet.svg)](https:\u002F\u002Ficlr.cc\u002Fvirtual\u002F2025\u002Fposter\u002F29620)\n[![论坛](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpytorch-forum-DE3412.svg)](https:\u002F\u002Fdiscuss.pytorch.org\u002Fc\u002Fdistributed\u002Ftorchtitan\u002F44)\n[![许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-BSD_3--Clause-lightgrey.svg)](.\u002FLICENSE)\n[![pip](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftorchtitan?color=blue)](https:\u002F\u002Fpypi.org\u002Fproject\u002Ftorchtitan\u002F)\n[![conda](https:\u002F\u002Fimg.shields.io\u002Fconda\u002Fvn\u002Fconda-forge\u002Ftorchtitan?color=green)](https:\u002F\u002Fanaconda.org\u002Fconda-forge\u002Ftorchtitan)\n\n\n\u003C\u002Fdiv>\n\n`torchtitan` 目前正处于大规模开发中。为了使用 `torchtitan` 的最新功能，我们建议您使用最新的 PyTorch 夜间版本。\n\n\n## 最新消息\n- [2025年11月] AMD 发布了针对 AMD 显卡的 `torchtitan` [优化分支](https:\u002F\u002Fgithub.com\u002FAMD-AGI\u002Ftorchtitan-amd\u002Ftree\u002Fmain)。\n- [2025年10月] 我们发布了 `torchtitan` [v0.2.0](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Freleases)。\n- [2025年10月] SkyPilot 现已支持 `torchtitan`！请参阅教程 [这里](https:\u002F\u002Fdocs.skypilot.co\u002Fen\u002Flatest\u002Fexamples\u002Ftraining\u002Ftorchtitan.html)。\n- [2025年7月] 我们发布了关于如何将模型添加到 `torchtitan` 的 [说明](\u002Ftorchtitan\u002Fmodels\u002FREADME.md)。\n- [2025年4月] 我们的论文已被 [ICLR 2025](https:\u002F\u002Ficlr.cc\u002Fvirtual\u002F2025\u002Fposter\u002F29620) 接受。\n- [2024年12月] GPU MODE [讲座](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=VYWRjcUqW6w) 讲解 torchtitan。\n- [2024年7月] 在 PyTorch 2024 大会上进行了 [演讲](https:\u002F\u002Fpytorch2024.sched.com\u002Fevent\u002F1fHn3)。\n\n\n## 概述\n\n`torchtitan` 是一个原生基于 PyTorch 的平台，专为生成式 AI 模型的 **快速实验和大规模训练** 而设计。作为 PyTorch 原生扩展技术的一个极简、干净的实现，`torchtitan` 为开发者提供了一个灵活的基础，便于在此之上进行构建。借助 `torchtitan` 的 [扩展点](docs\u002Fextension.md)，用户可以轻松创建满足特定需求的自定义扩展。\n\n我们的使命是通过赋能研究人员和开发者探索新的模型架构与基础设施技术，加速生成式 AI 领域的创新。\n\n构建 `torchtitan` 时的指导原则：\n* 设计简洁易懂，易于使用且可针对不同训练目的进行扩展。\n* 应用多维并行时对模型代码的改动最小化。\n* 偏向于保持代码库的整洁与精简，同时提供基础的可重用或可替换组件。\n\n`torchtitan` 已通过支持预训练各种规模的 Llama 3.1 LLM，展示了 PyTorch 最新的分布式训练特性。\n\n## 参与贡献\n\n我们期待您的贡献！\n\n* 为了加速对 `torchtitan` 的贡献及围绕它的创新，我们设立了 [`experiments`](torchtitan\u002Fexperiments) 文件夹。新想法应从那里开始。如需贡献，请遵循 [`experiments 指南`](torchtitan\u002Fexperiments\u002FREADME.md)。\n* 对核心部分的修复与贡献，请遵循这些 [`指南`](CONTRIBUTING.md)。\n\n## Llama 3.1 训练\n\n### 可用的关键特性\n\n1. 多维可组合并行\n   - 带有参数级分片的 [FSDP2](docs\u002Ffsdp.md)\n   - [张量并行](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fdistributed.tensor.parallel.html)（包括 [异步 TP](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fdistributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch\u002F209487)）\n   - [流水线并行](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fdistributed-w-torchtitan-training-with-zero-bubble-pipeline-parallelism\u002F214420)\n   - [上下文并行](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fdistributed-w-torchtitan-breaking-barriers-training-long-context-llms-with-1m-sequence-length-in-pytorch-using-context-parallel\u002F215082)\n2. [Meta device](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fmeta.html) 初始化\n3. 按操作选择性及全激活检查点保存\n4. [分布式检查点](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fdistributed-w-torchtitan-optimizing-checkpointing-efficiency-with-pytorch-dcp\u002F211250)（包括异步检查点）\n   - [互操作检查点](docs\u002Fcheckpoint.md)，可以直接加载到 [`torchtune`](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtune) 中用于微调\n5. `torch.compile` 支持\n6. [Float8](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fdistributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2\u002F209323) 支持（[使用方法](docs\u002Ffloat8.md)）\n7. Blackwell GPU 上的 [MXFP8 训练，适用于密集型和 MoE 模型](docs\u002Fmxfp8.md)。\n7. DDP 和 HSDP\n8. [TorchFT](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchft) 集成\n9. 可检查点的数据加载，预配置了 C4 数据集（1.44亿条记录），并支持 [自定义数据集](docs\u002Fdatasets.md)\n10. 梯度累积，可通过在命令行中添加额外的 `--training.global_batch_size` 参数来启用\n11. 灵活的学习率调度器（预热-稳定-衰减）\n12. 损失、GPU 内存、吞吐量（tokens\u002Fsec）、TFLOPs 和 MFU 会通过 [TensorBoard 或 Weights & Biases](\u002Fdocs\u002Fmetrics.md) 进行显示和记录\n13. [调试工具](docs\u002Fdebugging.md)，包括 CPU\u002FGPU 性能分析、内存分析、Flight Recorder 等\n14. 所有选项均可通过 [Python 配置注册表](torchtitan\u002Fmodels\u002Fllama3\u002Fconfig_registry.py) 轻松配置，只需使用 `--module` 和 `--config` 命令行参数\n15. [辅助脚本](scripts\u002F) 用于：\n    - 从 Hugging Face 下载分词器\n    - 将原始 Llama 3 检查点转换为预期的 DCP 格式\n    - 在不实例化模型的情况下估算 FSDP\u002FHSDP 的内存占用\n    - 使用张量并行运行分布式推理\n\n我们报告了在最多 512 张 GPU 上的 [性能](benchmarks\u002Fllama3_h100_202412_torchtitan.md)，并验证了各种技术的 [损失收敛](docs\u002Fconverging.md) 正确性。\n\n### 深入代码\n\n你可能想了解模型是如何定义的，或者并行化技术是如何应用的。为了更好地理解，可以先查看以下文件：\n* [torchtitan\u002Ftrain.py](torchtitan\u002Ftrain.py) - 主训练循环及高层次的设置代码\n* [torchtitan\u002Fmodels\u002Fllama3\u002Fmodel.py](torchtitan\u002Fmodels\u002Fllama3\u002Fmodel.py) - Llama 3.1 模型的定义\n* [torchtitan\u002Fmodels\u002Fllama3\u002Fparallelize.py](torchtitan\u002Fmodels\u002Fllama3\u002Fparallelize.py) - 用于将数据并行、张量并行、激活检查点以及 `torch.compile` 应用到模型上的辅助工具\n* [torchtitan\u002Fdistributed\u002Fpipeline_parallel.py](torchtitan\u002Fdistributed\u002Fpipeline_parallel.py) - 用于将流水线并行应用到模型上的辅助工具\n* [torchtitan\u002Fcomponents\u002Fcheckpoint.py](torchtitan\u002Fcomponents\u002Fcheckpoint.py) - 用于保存和加载分布式检查点的工具\n* [torchtitan\u002Fcomponents\u002Fquantization\u002Ffloat8.py](torchtitan\u002Fcomponents\u002Fquantization\u002Ffloat8.py) - 用于应用 Float8 技术的工具\n\n\n## 安装\n\n你可以直接运行源代码，也可以从 nightly 构建版本或稳定版中安装 `torchtitan`。\n\n### 从源码安装\n\n此方法需要 PyTorch 的 nightly 构建版本，或者从 [源码](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch?tab=readme-ov-file#from-source) 构建的最新 PyTorch 版本。\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\ncd torchtitan\npip install -r requirements.txt\npip install --pre torchdata --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcpu\n```\n\n> **注意：** 使用 PyTorch 的 nightly 版本时，需要 `torchdata` 的 nightly 构建版本。请按照上述方式从 nightly 索引中安装。\n\n### Nightly 构建版本\n\n此方法需要 PyTorch 的 nightly 构建版本。你可以将 `cu128` 替换为其他版本的 CUDA 或 AMD GPU（例如 `rocm6.3`）。\n\n```sh\npip3 install --pre torch --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu128 --force-reinstall\npip install --pre torchtitan --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu128\n```\n\n### 安装稳定版\n\n你可以通过 `pip` 或 `conda` 安装 `torchtitan` 的最新 [稳定版](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Freleases)。\n```sh\npip install torchtitan\n```\n```sh\nconda install conda-forge::torchtitan\n```\n请注意，每个稳定版都会锁定 `torch` 和 `torchao` 的 nightly 版本。更多详情请参阅 [release.md](docs\u002Frelease.md) 文件。\n\n### 下载分词器\n\n`torchtitan` 目前开箱即用地支持训练 Llama 3.1（8B、70B、405B）。要开始训练这些模型，我们需要下载分词器。请按照官方 [meta-llama](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.1-8B) 仓库中的说明操作，以确保你有权访问 Llama 模型权重。\n\n确认权限后，你可以运行以下命令将 Llama 3.1 分词器下载到本地机器上。\n\n```bash\n# 从 https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens 获取你的 HF 令牌\n\n# Llama 3.1 分词器\npython scripts\u002Fdownload_hf_assets.py --repo_id meta-llama\u002FLlama-3.1-8B --assets tokenizer --hf_token=...\n```\n\n### 开始训练\n\n在 8 张 GPU 上本地训练 Llama 3 8B 模型\n\n```bash\nMODULE=llama3 CONFIG=llama3_8b .\u002Frun_train.sh\n```\n\n### 多节点训练\n\n对于 ParallelCluster\u002FSlurm 类型的配置，你可以使用 `multinode_trainer.slurm` 文件提交 sbatch 作业。\n\n首先调整节点和 GPU 的数量：\n```\n#SBATCH --ntasks=2\n#SBATCH --nodes=2\n```\n\n然后启动一个运行，其中 `nnodes` 是你的总节点数，与上面的 sbatch 节点数一致。\n```\nsrun torchrun --nnodes 2\n```\n\n如果你每节点的 GPU 数量不是 8 张，请相应调整 `torchrun` 命令中的 `--nproc_per_node` 以及 `SBATCH` 命令部分的 `#SBATCH --gpus-per-task`。\n\n\n## 引用\n\n我们详细介绍了 `torchtitan` 中可用的并行化和优化技术，并总结了何时使用各种技术的建议。\n\n[TorchTitan：面向生产级 LLM 预训练的一站式 PyTorch 原生解决方案](https:\u002F\u002Fopenreview.net\u002Fforum?id=SFN6Wm7YBI)\n```\n@inproceedings{\n   liang2025torchtitan,\n   title={TorchTitan: One-stop PyTorch native solution for production ready {LLM} pretraining},\n   author={Wanchao Liang and Tianyu Liu and Less Wright and Will Constable and Andrew Gu and Chien-Chin Huang and Iris Zhang and Wei Feng and Howard Huang and Junjie Wang and Sanket Purandare and Gokul Nadathur and Stratos Idreos},\n   booktitle={The Thirteenth International Conference on Learning Representations},\n   year={2025},\n   url={https:\u002F\u002Fopenreview.net\u002Fforum?id=SFN6Wm7YBI}\n}\n```\n\n\n## 许可证\n\n源代码根据 [BSD 3 许可证](.\u002FLICENSE) 提供，但你可能还需遵守其他法律义务，这些义务适用于本仓库中链接的其他内容，例如第三方数据和模型的许可证或服务条款。","# TorchTitan 快速上手指南\n\nTorchTitan 是一个原生的 PyTorch 平台，专为生成式 AI 模型的**快速实验和大规模训练**而设计。它提供了最小化且清晰的代码实现，支持多维并行策略（如 FSDP2、张量并行、流水线并行等），是预训练 Llama 3.1 等大模型的理想选择。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐 Ubuntu 20.04+)\n- **GPU**: NVIDIA GPU (支持 CUDA) 或 AMD GPU (需使用特定 fork 版本)\n- **Python**: 3.8 或更高版本\n\n### 前置依赖\n- **PyTorch**: \n  - 若要使用最新特性，强烈建议安装 **PyTorch Nightly** 版本。\n  - 若使用稳定版，请注意部分新特性可能不可用。\n- **其他依赖**: `torchdata` (Nightly 版本需单独安装)。\n\n> **注意**: 国内用户若访问 PyT官方源较慢，可尝试配置清华或阿里镜像源加速 pip 下载，但 Nightly 版本通常仅官方源提供最新构建。\n\n## 安装步骤\n\n你可以选择从源码安装、安装 Nightly 版本或稳定版本。\n\n### 方式一：从源码安装（推荐用于开发和新特性体验）\n\n此方法需要预先安装 PyTorch Nightly。\n\n```bash\n# 1. 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\ncd torchtitan\n\n# 2. 安装基础依赖\npip install -r requirements.txt\n\n# 3. 安装 PyTorch Nightly 和 torchdata (以 CUDA 12.8 为例，可根据实际环境调整 cu12x)\npip install --pre torch --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcu128 --force-reinstall\npip install --pre torchdata --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcpu\n```\n\n### 方式二：安装稳定发布版\n\n适合生产环境或不需要最新实验特性的场景。\n\n**使用 pip:**\n```bash\npip install torchtitan\n```\n\n**使用 conda:**\n```bash\nconda install conda-forge::torchtitan\n```\n\n> **提示**: 稳定版本会锁定特定的 `torch` 和 `torchao` 夜间版版本，详情请查阅项目文档中的 release.md。\n\n### 获取模型分词器 (Tokenizer)\n\nTorchTitan 原生支持 Llama 3.1 (8B, 70B, 405B) 训练。开始前需下载分词器。\n请确保你已在 Hugging Face 获得 `meta-llama` 的访问权限并获取 Token。\n\n```bash\n# 将 ... 替换为你的 Hugging Face Token\npython scripts\u002Fdownload_hf_assets.py --repo_id meta-llama\u002FLlama-3.1-8B --assets tokenizer --hf_token=...\n```\n\n## 基本使用\n\n### 单机多卡训练示例\n\n以下命令演示如何在本地 8 张 GPU 上启动 Llama 3 8B 模型的训练：\n\n```bash\nMODULE=llama3 CONFIG=llama3_8b .\u002Frun_train.sh\n```\n\n### 多节点训练示例 (Slurm 环境)\n\n若在集群环境（如 Slurm）运行，需修改提交脚本 `multinode_trainer.slurm` 中的节点数和 GPU 数，然后提交任务：\n\n1. 编辑脚本设置节点数：\n   ```bash\n   #SBATCH --ntasks=2\n   #SBATCH --nodes=2\n   ```\n2. 启动训练（确保 `nnodes` 与脚本设置一致）：\n   ```bash\n   srun torchrun --nnodes 2 ...\n   ```\n\n### 关键功能配置\n\n所有训练选项均可通过 CLI 参数灵活配置：\n- **混合精度\u002F量化**: 支持 Float8 和 MXFP8 (需 Blackwell GPU)。\n- **并行策略**: 自动组合 FSDP2、张量并行 (TP)、流水线并行 (PP) 和上下文并行 (CP)。\n- **监控**: 支持 Tensorboard 和 Weights & Biases 记录 Loss、显存、吞吐量等指标。\n\n详细的高级配置和自定义扩展请参考项目目录下的 `docs\u002F` 文件夹及 `torchtitan\u002Fmodels\u002Fllama3\u002Fconfig_registry.py`。","某初创 AI 实验室的研究团队正试图在有限的 8 卡集群上从头预训练一个定制化的 Llama 3.1 变体，以验证新的架构假设。\n\n### 没有 torchtitan 时\n- **并行策略实现复杂**：手动组合 FSDP2 与张量并行（Tensor Parallel）需要编写大量样板代码，稍有不慎就会导致显存溢出或通信死锁。\n- **模型修改成本高**：为了适配分布式训练，不得不大幅重构原有的模型定义代码，导致实验迭代周期从几天延长至数周。\n- **调试门槛极高**：缺乏原生的可观测性支持，排查多卡环境下的梯度同步问题如同“黑盒摸象”，严重拖慢研发进度。\n- **扩展性受限**：现有脚本难以灵活切换异步张量并行等高级特性，无法充分利用硬件算力进行大规模探索。\n\n### 使用 torchtitan 后\n- **开箱即用的并行组合**：直接调用 torchtitan 内置的多维并行组件，无需修改核心模型代码即可一键启用 FSDP2 分片与异步张量并行。\n- **极简的代码侵入**：保持模型结构干净清晰，仅通过配置即可应用复杂的分布式策略，让研究人员专注于算法创新而非工程琐事。\n- **原生可观测与调试**：依托 PyTorch 原生生态，轻松监控训练状态与资源消耗，快速定位并解决分布式训练中的瓶颈。\n- **灵活的特性拓展**：利用其清晰的扩展点机制，团队迅速集成了自定义优化器，顺利完成了从单节点到多节点的大规模训练验证。\n\ntorchtitan 通过提供最小化且原生的分布式训练基座，将大模型预训练的工程复杂度降至最低，让团队能以前所未有的速度验证生成式 AI 的新架构想法。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fpytorch_torchtitan_997d6f4e.png","pytorch","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fpytorch_be722ba8.jpg","",null,"https:\u002F\u002Fpytorch.org","https:\u002F\u002Fgithub.com\u002Fpytorch",[82,86,90],{"name":83,"color":84,"percentage":85},"Python","#3572A5",99.2,{"name":87,"color":88,"percentage":89},"Shell","#89e051",0.8,{"name":91,"color":92,"percentage":93},"Dockerfile","#384d54",0.1,5207,771,"2026-04-03T06:24:05","BSD-3-Clause",4,"Linux","必需。支持 NVIDIA GPU (CUDA 12.8\u002Fnightly) 或 AMD GPU (ROCm 6.3，需使用优化分支)。文档提及在 8 至 512 张 GPU 上运行，支持 Blackwell 架构进行 MXFP8 训练。具体显存需求取决于模型大小（如 Llama 3.1 8B\u002F70B\u002F405B）和并行策略。","未说明",{"notes":103,"python":101,"dependencies":104},"1. 强烈建议使用最新的 PyTorch nightly 版本以获取最新功能。\n2. 若使用 PyTorch nightly，必须从 nightly 索引安装对应的 torchdata。\n3. 支持通过 Slurm 或多节点集群进行分布式训练。\n4. 训练 Llama 3.1 模型前需手动下载 Tokenizer 和权重。\n5. AMD 用户需使用专门的优化分支 (torchtitan-amd)。",[105,106,107],"torch (nightly build 推荐)","torchdata (nightly build)","torchao (稳定版中固定版本)",[26,13],"2026-03-27T02:49:30.150509","2026-04-06T09:44:30.016876",[112,117,122,127,132,137],{"id":113,"question_zh":114,"answer_zh":115,"source_url":116},10461,"在使用 Float8 行式缩放（rowwise scaling）配合异步张量并行（AsyncTP）和完整激活检查点（full AC）时，遇到维度不匹配错误（leading dims mismatch）怎么办？","该问题是由于张量重塑（reshape）后，scatter_dim 的索引未更新导致的。原始 3 维张量的 scatter_dim=1 对应序列维度，但在 torchao 中重塑为 2 维后，该索引错误地应用到了隐藏维度上。解决方案是应用 PyTorch Core 的修复补丁：https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fpull\u002F148001。该修复解决了崩溃问题且不影响数值精度。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fissues\u002F864",{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},10462,"为什么使用行式 Float8（rowwise Float8）时，异步张量并行（AsyncTP）的性能提升不明显，与普通 TP 差不多？","这是因为部分融合操作尚未完全支持行式缩放。目前 `matmul-reduce-scatter` 融合已实现（见 https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fpull\u002F149247），能带来约 3.7% 的性能提升。但要获得预期的 8%+ 性能增益，还需要实现 `all-gather-matmul` 与行式缩放的融合，相关工作正在进行中（见 https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fissues\u002F149990）。此外，https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fpull\u002F149652 修复了不必要的 reduce_scatter 激活保存问题，也有助于性能优化。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fissues\u002F866",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},10463,"使用行式 FP8 量化训练 Llama 模型时出现 NaN Loss 如何解决？","这是一个已知问题，已在 TorchTitan 中通过临时方案修复。请合并或应用 PR #1108 (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F1108) 中的 workaround。底层问题的长期修复正在 PyTorch Core 中跟进，详见：https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fissues\u002F150859。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fissues\u002F1056",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},10464,"新增的 fused RMSNorm 算子导致张量并行（TP）训练失败，报错指向 backward 阶段，如何处理？","需要在 DTensor 中添加对 `aten._fused_rms_norm.default` 和 `aten._fused_rms_norm_backward.default` 算子的支持。由于这些融合算子引入了新的分发逻辑，当前的分解（decomp）路径可能无法正确处理分布式场景。建议暂时避免使用该融合算子，或者等待 TorchTitan 更新以显式支持这些新算子的分布式实现。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fissues\u002F1421",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},10465,"遇到 checkpoint 重计算元数据不匹配错误（Recomputed values have different metadata），特别是在多节点训练时，该如何排查？","该问题在某些配置下（如 4 节点训练）可能出现，但在单节点正常。由于难以复现，维护者建议提供具体的复现步骤（reproducer）以便调试。如果是在使用 SAC（Selective Activation Checkpointing）时遇到吞吐量下降，可能是独立的问题。建议检查各节点的张量形状和设备元数据是否一致，并尝试在 TorchTitan 中构建最小复现案例提交给维护者。","https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fissues\u002F1117",{"id":138,"question_zh":139,"answer_zh":140,"source_url":121},10466,"Float8 rowwise 模式下，为何单元测试通过但实际运行 torchtitan 时会出现 meta registration 错误？","这通常是因为单元测试验证了图结构中存在融合节点，但实际运行时输入张量的维度或缩放因子（scale）的元数据注册逻辑与实际执行路径存在差异。具体来说，`meta_scaled_mm` 要求 scale_a 和 scale_b 必须是 2 维，而在某些 3D+ 输入的重塑过程中可能未正确传递维度信息。需确保使用的 PyTorch 版本包含了针对 3D+ 输入形状的行式缩放鲁棒性支持（参考 https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fpull\u002F149247 的更新）。",[142,147,152,157],{"id":143,"version":144,"summary_zh":145,"released_at":146},71027,"v0.2.2","## Dependency\r\nPyTorch Version: `torch-2.12.0.dev20260220+cu126`\r\nTorchAO Version: `torchao-0.17.0.dev20260220+cu126`\r\n\r\n## What's changed \r\n## 🚀 Features\r\n\r\n- [CP] Refactor Context Parallel to use new PyTorch CP APIs (#2144) by @fegin\r\n- [CP] Enable FlexCP for llama3 (#2145) by @fegin\r\n- [Compiler Toolkit] Add option for full inductor (#2150) by @aditvenk\r\n- Add docs to explain `COMM_MODE` (#2162) by @fegin\r\n- Disable dynamo LRU cache when AC is enabled (#2204) by @soulitzer\r\n- feat(gpt-oss): add YaRN RoPE extensions with mscale for extended context (#2216) by @eous\r\n- [ROCm] Support mxfp8 on gfx950 (#2222) by @RuibinCheung\r\n- Enable memory snapshot for generic devices (#2228) by @frost-intel\r\n- GQA without kv repeats (#2259) by @francesco-bertolotti\r\n- [rl] GQA attention enablement in torchtitan vllm wrapper (#2299) by @wwwjn\r\n- Add peak flops for NVIDIA H20 GPUs (#2307) by @DamonFool\r\n- Add missing `job_config.maybe_log()` calls (#2308) by @EquationWalker\r\n- [DeepEP] Implement shared_experts overlap with deepep.combine() (#2310) by @vivekgoe\r\n- Separate out training for fault tolerance (#2311) by @tushar00jain\r\n- Torchtitan changes to integrate into Verl (#2333) by @acisseJZhong\r\n- Maintain same LR schedule for early stop debug runs (#2340) by @acisseJZhong\r\n- [Compiler Toolkit] Separate process groups for FSDP AG\u002FRS comm overlap (#2368) by @yiming0416\r\n- [AC] Set `preserve_rng_state=True` as default for activation checkpointing (#2380) by @soulitzer\r\n\r\n## 🧠 Model\r\n\r\n- Add attention scaling to varlen for Qwen3 (#2178) by @liangel-02\r\n- Make get TP mesh optional in Llama4 parallelize (#2185) by @danielvegamyhre\r\n- [GPT-OSS] Graduate from experiments to main (#2203) by @shuhuayu\r\n- [autoparallel] Update `local_map_deepseek_v3` device mesh usage (#2231) by @xmfan\r\n- [varlen_attn] Change `is_causal` to `window_size` (#2267) by @liangel-02\r\n- Remove `_ScaledPartial` placement (#2337) by @Aidyn-A\r\n- Added custom `trunc_normal` (#2342) by @francesco-bertolotti\r\n- Removed weight initialization from model `__init__` (#2361) by @francesco-bertolotti\r\n\r\n## 🐛 Bug Fixes\r\n\r\n- [docs] Fix missing `--model.flavor` flags in compiler_toolkit README (#2201) by @BryanBradfo\r\n- Fix loss computation by handling valid token imbalance in train loop (#2206) by @wwwjn\r\n- [MoE] Fix experts DTensor metadata bug for DCP (#2227) by @shuhuayu\r\n- Fix sdpa-varlen attention mismatch in Qwen3 (#2229) by @francesco-bertolotti\r\n- Weight tying fix for Qwen3 (#2253) by @francesco-bertolotti\r\n- Fix grad norm clipping for AutoP and DSv3 model init (#2270) by @sanketpurandare\r\n- [MoE] DeepEP refactor and fix memory leak during training and inference (#2296) by @shuhuayu\r\n- [docs] Fix type mismatch in model layers comments (#2306) by @DamonFool\r\n- Fix FLUX attention by exposing `is_causal` in SDPA (#2309) by @wwwjn\r\n- Fixing `global_max_loss` computation (#2314) by @Shagun-G\r\n- Fix the CI loss issue (#2315) by @fegin\r\n- Fix gpt-oss implementation (MoE router gate bias + top-k renorm) (#2319) by @linyuhongg\r\n- [SimpleFSDP] Fix HSDP placement mismatch in `_distribute_dtensor` (#2329) by @SongyuanZhao\r\n- [Bugfix] Fix bitwise determinism after vLLM `SiluAndMul` change (#2358) by @Lucaskabela\r\n- [Bugfix] Fix `simple_rl_multiprocess.py` to be runnable with recent vLLM version (#2359) by @Lucaskabela\r\n- Fixing extra averaging performed in validation error (#2366) by @Shagun-G\r\n- Bug fix: don't swallow `OutOfMemoryError` when `enable_memory_snapshot=True` (#2374) by @weifengpy\r\n- Fix: restrict completion logging to rank 0 (#2383) by @fatih-uzlmz\r\n\r\n## 🧪 Experiments \u002F CI \u002F Infra\r\n\r\n- [Experimental][rl][vllm compat] Update simple_rl example to work with vLLM nightly (#2219) by @Lucaskabela\r\n- [Experimental][rl][unified] Update `infer.py` example to work with vLLM nightly (#2226) by @Lucaskabela\r\n- Add ROCm support for H100 tests (#2202) by @akashveramd\r\n- Add ROCm CI support for simple FSDP experiments test (#2220) by @akashveramd\r\n- Add test for DSv3 with flexattn + fsdp + ep + pp + sac op (#2234) by @shuhuayu\r\n- Add ROCm CI support for Auto Parallel & Compiler Toolkit experiments (#2248) by @akashveramd\r\n- Add ROCm CI support for Transformers Modeling Backend & VLM experiments (#2276) by @akashveramd\r\n- Update CPU unit test to use `linux_job_v2` (#2287) by @joecummings\r\n- [BE week] Disable CPU wheel builds in nightly CI (#2289) by @joecummings\r\n- [BE][NFC] Add integration test for simplefsdp + CP deepseek_v3 (#2301) by @aditvenk\r\n- Fixed autoparallel integration tests on ROCm (#2321) by @wenchenvincent\r\n- [rl][ez] Squash landing import and git fixes (#2331) by @zhxchen17\r\n- [ci] Add DSv3 SimpleFSDP `auto_bucketing` to H100 CI jobs (#2347) by @IvanKobzarev\r\n- Bump `tj-actions\u002Fchanged-files` from 47.0.1 to 47.0.2 (#2367) by @dependabot\r\n- [CI] Disable NVLS (#2372) by @fegin\r\n- Bump `tj-actions\u002Fchanged-files` from 47.0.2 to 47.0.4 (#2390) by @dependabot\r\n- [rl] Install vllm from pre-built wheels (#2397) by @wwwjn\r\n- [rl\u002Funified] Update default `model-ckpt-pa","2026-02-20T22:47:00",{"id":148,"version":149,"summary_zh":150,"released_at":151},71028,"v0.2.1","\r\n## Dependency\r\npytorch verison: `torch-2.11.0.dev20251226+cu126`\r\ntorchao version: `torchao-0.16.0.dev20251226+cu126`\r\n\r\n## What's Changed\r\n### Features\r\n* Use new DeviceMesh unflatten to rewrite parallel_dims by @fegin in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F1660\r\n* Re:Run Torchtitan ROCm workflow on cron schedule & push to Main branch only by @akashveramd in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F2018\r\n* adding variable length attention to llama3 8b   by @liangel-02 in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F2000\r\n* [Local Tensor] Replace dry_run.py with fake mode implementation by @fegin in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F2057\r\n\r\n### Model\r\n* Enable PP and EP overlap for MoE by @H-Huang in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F1721\r\n* Integrate DeepEP to torchtitan by @elfiegg in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F2107\r\n* [MoE] Add node limited routing support by @shuhuayu in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F2111\r\n* Add Context Parallelism to Flux model training by @limou102 in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F1851\r\n* gpt-oss model enablement by @wwwjn in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F1754\r\n* [GPT-OSS] Add HF state dict adapter to support loading from HF checkpoints by @shuhuayu in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F2021\r\n\r\n### Bug Fix\r\n* [FLOPs] Fix attention FLOPs estimate by @shuhuayu in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F1923\r\n* Fix apply_compile called multiple times in PP initialization by @xmfan in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F2135\r\n* Fix qwen3 attention scaling calculation by @wwwjn in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F2173\r\n\r\n### Experiments\r\n* **Exploring toolkit-style use of the compiler stack** @SherlockNoMad @yiming0416 :  [Compiler Toolkit] JointGraph-based Training Prototype for llama3 by @SherlockNoMad in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F1794\r\n* **Bit-wise identity RL between torchtitan Trainer and vLLM sampler**: Add deterministic RL training experiment with vLLM by @bwasti in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F1975\r\n* **Train models from`transformers` with torchtitan**:  3outeille\u002Ftransformers backend (Dense model only) by @3outeille in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F2048\r\n* **Auto Parallel Examples** @wconstab @xmfan: Autoparallel as an experiment in main by @xmfan in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F2054\r\n* **Unified Model definition in RL loop** @wwwjn  @acisseJZhong  @zhxchen17 : Run vLLM inference using torchtitan model definition (single GPU) by @wwwjn in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fpull\u002F2119\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fcompare\u002Fv0.2.0...v0.2.1","2025-12-26T23:29:24",{"id":153,"version":154,"summary_zh":155,"released_at":156},71029,"v0.2.0","# Dependency \r\n\r\npytorch verison: `torch-2.10.0.dev20251019+cu126`\r\ntorchao version: `torchao-0.15.0.dev20251015+cu126`\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fcompare\u002Fv0.1.0...v0.2.0","2025-10-18T04:26:23",{"id":158,"version":159,"summary_zh":160,"released_at":161},71030,"v0.1.0","This is the first pre-release of torchtitan, following the release practice outlined in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\u002Fissues\u002F688.\r\n\r\ntorch version: `torch-2.8.0.dev20250617+cu126`\r\ntorchao version: `torchao-0.12.0.dev20250617+cu126`","2025-06-18T00:39:29"]