[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-huggingface--accelerate":3,"similar-huggingface--accelerate":186},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":9,"readme_en":10,"readme_zh":11,"quickstart_zh":12,"use_case_zh":13,"hero_image_url":14,"owner_login":15,"owner_name":16,"owner_avatar_url":17,"owner_bio":18,"owner_company":19,"owner_location":19,"owner_email":19,"owner_twitter":15,"owner_website":20,"owner_url":21,"languages":22,"stars":38,"forks":39,"last_commit_at":40,"license":41,"difficulty_score":42,"env_os":43,"env_gpu":44,"env_ram":43,"env_deps":45,"category_tags":50,"github_topics":19,"view_count":42,"oss_zip_url":19,"oss_zip_packed_at":19,"status":52,"created_at":53,"updated_at":54,"faqs":55,"releases":85},8860,"huggingface\u002Faccelerate","accelerate","🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support","Accelerate 是由 Hugging Face 推出的开源库，旨在帮助开发者轻松地在各种硬件配置上运行 PyTorch 模型。无论是单台 CPU、单张 GPU、多卡分布式训练，还是使用 TPU，Accelerate 都能让同一份代码无缝适配，无需针对不同环境反复修改。\n\n它主要解决了深度学习训练中繁琐的“样板代码”问题。传统模式下，开发者需要手动编写大量逻辑来处理设备放置、混合精度（如 fp16、bf16 乃至最新的 fp8）以及复杂的分布式策略（如 FSDP 和 DeepSpeed）。而 Accelerate 通过极简的 API 抽象了这些底层细节，用户只需在原有训练脚本中增加几行代码，即可自动完成模型、优化器和数据加载器的准备与加速，大幅降低了多设备训练的门槛。\n\n这款工具特别适合习惯手写 PyTorch 训练循环的研究人员和工程师。如果你希望专注于模型算法本身，而不愿被底层基础设施的兼容性所困扰，Accelerate 是理想的选择。其核心亮点在于“一次编写，到处运行”：同样的代码既可在本地笔记本上调试，也能直接部署到大型集群进行大规模训练，真正实现了开发效率与运行性能的平衡","Accelerate 是由 Hugging Face 推出的开源库，旨在帮助开发者轻松地在各种硬件配置上运行 PyTorch 模型。无论是单台 CPU、单张 GPU、多卡分布式训练，还是使用 TPU，Accelerate 都能让同一份代码无缝适配，无需针对不同环境反复修改。\n\n它主要解决了深度学习训练中繁琐的“样板代码”问题。传统模式下，开发者需要手动编写大量逻辑来处理设备放置、混合精度（如 fp16、bf16 乃至最新的 fp8）以及复杂的分布式策略（如 FSDP 和 DeepSpeed）。而 Accelerate 通过极简的 API 抽象了这些底层细节，用户只需在原有训练脚本中增加几行代码，即可自动完成模型、优化器和数据加载器的准备与加速，大幅降低了多设备训练的门槛。\n\n这款工具特别适合习惯手写 PyTorch 训练循环的研究人员和工程师。如果你希望专注于模型算法本身，而不愿被底层基础设施的兼容性所困扰，Accelerate 是理想的选择。其核心亮点在于“一次编写，到处运行”：同样的代码既可在本地笔记本上调试，也能直接部署到大型集群进行大规模训练，真正实现了开发效率与运行性能的平衡。","\u003C!---\nCopyright 2021 The HuggingFace Team. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n-->\n\n\u003Cp align=\"center\">\n    \u003Cbr>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_accelerate_readme_852e72c7a923.png\" width=\"400\"\u002F>\n    \u003Cbr>\n\u003Cp>\n\n\u003Cp align=\"center\">\n    \u003C!-- Uncomment when CircleCI is set up\n    \u003Ca href=\"https:\u002F\u002Fcircleci.com\u002Fgh\u002Fhuggingface\u002Faccelerate\">\u003Cimg alt=\"Build\" src=\"https:\u002F\u002Fimg.shields.io\u002Fcircleci\u002Fbuild\u002Fgithub\u002Fhuggingface\u002Ftransformers\u002Fmaster\">\u003C\u002Fa>\n    -->\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg alt=\"License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fhuggingface\u002Faccelerate.svg?color=blue\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Findex.html\">\u003Cimg alt=\"Documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fwebsite\u002Fhttp\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Findex.html.svg?down_color=red&down_message=offline&up_message=online\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Freleases\">\u003Cimg alt=\"GitHub release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Fhuggingface\u002Faccelerate.svg\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fblob\u002Fmain\u002FCODE_OF_CONDUCT.md\">\u003Cimg alt=\"Contributor Covenant\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FContributor%20Covenant-v2.0%20adopted-ff69b4.svg\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Ch3 align=\"center\">\n\u003Cp>Run your *raw* PyTorch training script on any kind of device\n\u003C\u002Fh3>\n\n\u003Ch3 align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fhf.co\u002Fcourse\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_accelerate_readme_d40d18c65161.png\">\u003C\u002Fa>\n\u003C\u002Fh3>\n\n## Easy to integrate\n\n🤗 Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs\u002FTPU\u002Ffp16.\n\n🤗 Accelerate abstracts exactly and only the boilerplate code related to multi-GPUs\u002FTPU\u002Ffp16 and leaves the rest of your code unchanged.\n\nHere is an example:\n\n```diff\n  import torch\n  import torch.nn.functional as F\n  from datasets import load_dataset\n+ from accelerate import Accelerator\n\n+ accelerator = Accelerator()\n- device = 'cpu'\n+ device = accelerator.device\n\n  model = torch.nn.Transformer().to(device)\n  optimizer = torch.optim.Adam(model.parameters())\n\n  dataset = load_dataset('my_dataset')\n  data = torch.utils.data.DataLoader(dataset, shuffle=True)\n\n+ model, optimizer, data = accelerator.prepare(model, optimizer, data)\n\n  model.train()\n  for epoch in range(10):\n      for source, targets in data:\n          source = source.to(device)\n          targets = targets.to(device)\n\n          optimizer.zero_grad()\n\n          output = model(source)\n          loss = F.cross_entropy(output, targets)\n\n-         loss.backward()\n+         accelerator.backward(loss)\n\n          optimizer.step()\n```\n\nAs you can see in this example, by adding 5-lines to any standard PyTorch training script you can now run on any kind of single or distributed node setting (single CPU, single GPU, multi-GPUs and TPUs) as well as with or without mixed precision (fp8, fp16, bf16).\n\nIn particular, the same code can then be run without modification on your local machine for debugging or your training environment.\n\n🤗 Accelerate even handles the device placement for you (which requires a few more changes to your code, but is safer in general), so you can even simplify your training loop further:\n\n```diff\n  import torch\n  import torch.nn.functional as F\n  from datasets import load_dataset\n+ from accelerate import Accelerator\n\n- device = 'cpu'\n+ accelerator = Accelerator()\n\n- model = torch.nn.Transformer().to(device)\n+ model = torch.nn.Transformer()\n  optimizer = torch.optim.Adam(model.parameters())\n\n  dataset = load_dataset('my_dataset')\n  data = torch.utils.data.DataLoader(dataset, shuffle=True)\n\n+ model, optimizer, data = accelerator.prepare(model, optimizer, data)\n\n  model.train()\n  for epoch in range(10):\n      for source, targets in data:\n-         source = source.to(device)\n-         targets = targets.to(device)\n\n          optimizer.zero_grad()\n\n          output = model(source)\n          loss = F.cross_entropy(output, targets)\n\n-         loss.backward()\n+         accelerator.backward(loss)\n\n          optimizer.step()\n```\n\nWant to learn more? Check out the [documentation](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate) or have a look at our [examples](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Ftree\u002Fmain\u002Fexamples).\n\n## Launching script\n\n🤗 Accelerate also provides an optional CLI tool that allows you to quickly configure and test your training environment before launching the scripts. No need to remember how to use `torch.distributed.run` or to write a specific launcher for TPU training!\nOn your machine(s) just run:\n\n```bash\naccelerate config\n```\n\nand answer the questions asked. This will generate a config file that will be used automatically to properly set the default options when doing\n\n```bash\naccelerate launch my_script.py --args_to_my_script\n``` \n\nFor instance, here is how you would run the GLUE example on the MRPC task (from the root of the repo):\n\n```bash\naccelerate launch examples\u002Fnlp_example.py\n```\n\nThis CLI tool is **optional**, and you can still use `python my_script.py` or `python -m torchrun my_script.py` at your convenience.\n\nYou can also directly pass in the arguments you would to `torchrun` as arguments to `accelerate launch` if you wish to not run` accelerate config`.\n\nFor example, here is how to launch on two GPUs:\n\n```bash\naccelerate launch --multi_gpu --num_processes 2 examples\u002Fnlp_example.py\n```\n\nTo learn more, check the CLI documentation available [here](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Fpackage_reference\u002Fcli).\n\nOr view the configuration zoo [here](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fblob\u002Fmain\u002Fexamples\u002Fconfig_yaml_templates\u002F)\n\n## Launching multi-CPU run using MPI\n\n🤗 Here is another way to launch multi-CPU run using MPI. You can learn how to install Open MPI on [this page](https:\u002F\u002Fwww.open-mpi.org\u002Ffaq\u002F?category=building#easy-build). You can use Intel MPI or MVAPICH as well.\nOnce you have MPI setup on your cluster, just run:\n```bash\naccelerate config\n```\nAnswer the questions that are asked, selecting to run using multi-CPU, and answer \"yes\" when asked if you want accelerate to launch mpirun.\nThen, use `accelerate launch` with your script like:\n```bash\naccelerate launch examples\u002Fnlp_example.py\n```\nAlternatively, you can use mpirun directly, without using the CLI like:\n```bash\nmpirun -np 2 python examples\u002Fnlp_example.py\n```\n\n## Launching training using DeepSpeed\n\n🤗 Accelerate supports training on single\u002Fmultiple GPUs using DeepSpeed. To use it, you don't need to change anything in your training code; you can set everything using just `accelerate config`. However, if you desire to tweak your DeepSpeed related args from your Python script, we provide you the `DeepSpeedPlugin`.\n\n```python\nfrom accelerate import Accelerator, DeepSpeedPlugin\n\n# deepspeed needs to know your gradient accumulation steps beforehand, so don't forget to pass it\n# Remember you still need to do gradient accumulation by yourself, just like you would have done without deepspeed\ndeepspeed_plugin = DeepSpeedPlugin(zero_stage=2, gradient_accumulation_steps=2)\naccelerator = Accelerator(mixed_precision='fp16', deepspeed_plugin=deepspeed_plugin)\n\n# How to save your 🤗 Transformer?\naccelerator.wait_for_everyone()\nunwrapped_model = accelerator.unwrap_model(model)\nunwrapped_model.save_pretrained(save_dir, save_function=accelerator.save, state_dict=accelerator.get_state_dict(model))\n```\n\nNote: DeepSpeed support is experimental for now. In case you get into some problem, please open an issue.\n\n## Launching your training from a notebook\n\n🤗 Accelerate also provides a `notebook_launcher` function you can use in a notebook to launch a distributed training. This is especially useful for Colab or Kaggle notebooks with a TPU backend. Just define your training loop in a `training_function` then in your last cell, add:\n\n```python\nfrom accelerate import notebook_launcher\n\nnotebook_launcher(training_function)\n```\n\nAn example can be found in [this notebook](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fnotebooks\u002Fblob\u002Fmain\u002Fexamples\u002Faccelerate_examples\u002Fsimple_nlp_example.ipynb). [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fhuggingface\u002Fnotebooks\u002Fblob\u002Fmain\u002Fexamples\u002Faccelerate_examples\u002Fsimple_nlp_example.ipynb)\n\n## Why should I use 🤗 Accelerate?\n\nYou should use 🤗 Accelerate when you want to easily run your training scripts in a distributed environment without having to renounce full control over your training loop. This is not a high-level framework above PyTorch, just a thin wrapper so you don't have to learn a new library. In fact, the whole API of 🤗 Accelerate is in one class, the `Accelerator` object.\n\n## Why shouldn't I use 🤗 Accelerate?\n\nYou shouldn't use 🤗 Accelerate if you don't want to write a training loop yourself. There are plenty of high-level libraries above PyTorch that will offer you that, 🤗 Accelerate is not one of them.\n\n## Frameworks using 🤗 Accelerate\n\nIf you like the simplicity of 🤗 Accelerate but would prefer a higher-level abstraction around its capabilities, some frameworks and libraries that are built on top of 🤗 Accelerate are listed below:\n\n* [Amphion](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FAmphion) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.\n* [Animus](https:\u002F\u002Fgithub.com\u002FScitator\u002Fanimus) is a minimalistic framework to run machine learning experiments. Animus highlights common \"breakpoints\" in ML experiments and provides a unified interface for them within [IExperiment](https:\u002F\u002Fgithub.com\u002FScitator\u002Fanimus\u002Fblob\u002Fmain\u002Fanimus\u002Fcore.py#L76).\n* [Catalyst](https:\u002F\u002Fgithub.com\u002Fcatalyst-team\u002Fcatalyst#getting-started) is a PyTorch framework for Deep Learning Research and Development. It focuses on reproducibility, rapid experimentation, and codebase reuse so you can create something new rather than write yet another train loop. Catalyst provides a [Runner](https:\u002F\u002Fcatalyst-team.github.io\u002Fcatalyst\u002Fapi\u002Fcore.html#runner) to connect all parts of the experiment: hardware backend, data transformations, model training, and inference logic.\n* [fastai](https:\u002F\u002Fgithub.com\u002Ffastai\u002Ffastai#installing) is a PyTorch framework for Deep Learning that simplifies training fast and accurate neural nets using modern best practices. fastai provides a [Learner](https:\u002F\u002Fdocs.fast.ai\u002Flearner.html#Learner) to handle the training, fine-tuning, and inference of deep learning algorithms.\n* [Finetuner](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Ffinetuner) is a service that enables models to create higher-quality embeddings for semantic search, visual similarity search, cross-modal text\u003C->image search, recommendation systems, clustering, duplication detection, anomaly detection, or other uses.\n* [InvokeAI](https:\u002F\u002Fgithub.com\u002Finvoke-ai\u002FInvokeAI) is a creative engine for Stable Diffusion models, offering industry-leading WebUI, terminal usage support, and serves as the foundation for many commercial products.\n* [Kornia](https:\u002F\u002Fkornia.readthedocs.io\u002Fen\u002Flatest\u002Fget-started\u002Fintroduction.html) is a differentiable library that allows classical computer vision to be integrated into deep learning models. Kornia provides a [Trainer](https:\u002F\u002Fkornia.readthedocs.io\u002Fen\u002Flatest\u002Fx.html#kornia.x.Trainer) with the specific purpose to train and fine-tune the supported deep learning algorithms within the library.\n* [Open Assistant](https:\u002F\u002Fprojects.laion.ai\u002FOpen-Assistant\u002F) is a chat-based assistant that understands tasks, can interact with their party systems, and retrieve information dynamically to do so. \n* [pytorch-accelerated](https:\u002F\u002Fgithub.com\u002FChris-hughes10\u002Fpytorch-accelerated) is a lightweight training library, with a streamlined feature set centered around a general-purpose [Trainer](https:\u002F\u002Fpytorch-accelerated.readthedocs.io\u002Fen\u002Flatest\u002Ftrainer.html), that places a huge emphasis on simplicity and transparency; enabling users to understand exactly what is going on under the hood, but without having to write and maintain the boilerplate themselves!\n* [Stable Diffusion web UI](https:\u002F\u002Fgithub.com\u002FAUTOMATIC1111\u002Fstable-diffusion-webui) is an open-source browser-based easy-to-use interface based on the Gradio library for Stable Diffusion.\n* [torchkeras](https:\u002F\u002Fgithub.com\u002Flyhue1991\u002Ftorchkeras) is a simple tool for training pytorch model just in a keras style, a dynamic and beautiful plot is provided in notebook to monitor your loss or metric.\n* [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) as a tool for helping train state-of-the-art machine learning models in PyTorch, Tensorflow, and JAX. (Accelerate is the backend for the PyTorch side).\n\n\n## Installation\n\nThis repository is tested on Python 3.8+ and PyTorch 1.10.0+\n\nYou should install 🤗 Accelerate in a [virtual environment](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fvenv.html). If you're unfamiliar with Python virtual environments, check out the [user guide](https:\u002F\u002Fpackaging.python.org\u002Fguides\u002Finstalling-using-pip-and-virtual-environments\u002F).\n\nFirst, create a virtual environment with the version of Python you're going to use and activate it.\n\nThen, you will need to install PyTorch: refer to the [official installation page](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F#start-locally) regarding the specific install command for your platform. Then 🤗 Accelerate can be installed using pip as follows:\n\n```bash\npip install accelerate\n```\n\n## Supported integrations\n\n- CPU only\n- multi-CPU on one node (machine)\n- multi-CPU on several nodes (machines)\n- single GPU\n- multi-GPU on one node (machine)\n- multi-GPU on several nodes (machines)\n- TPU\n- FP16\u002FBFloat16 mixed precision\n- FP8 mixed precision with [Transformer Engine](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTransformerEngine) or [MS-AMP](https:\u002F\u002Fgithub.com\u002FAzure\u002FMS-AMP\u002F)\n- DeepSpeed support (Experimental)\n- PyTorch Fully Sharded Data Parallel (FSDP) support (Experimental)\n- Megatron-LM support (Experimental)\n\n## Citing 🤗 Accelerate\n\nIf you use 🤗 Accelerate in your publication, please cite it by using the following BibTeX entry.\n\n```bibtex\n@Misc{accelerate,\n  title =        {Accelerate: Training and inference at scale made simple, efficient and adaptable.},\n  author =       {Sylvain Gugger and Lysandre Debut and Thomas Wolf and Philipp Schmid and Zachary Mueller and Sourab Mangrulkar and Marc Sun and Benjamin Bossan},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate}},\n  year =         {2022}\n}\n```\n","\u003C!---\n版权所有 © 2021 HuggingFace 团队。保留所有权利。\n\n根据 Apache License, Version 2.0（“许可证”）授权；\n除非符合许可证的规定，否则不得使用此文件。\n您可以在以下网址获取许可证副本：\n\n    http:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\n除非适用法律要求或书面同意，否则软件按“原样”分发，\n不提供任何形式的保证或条件，无论是明示的还是默示的。\n有关权限和限制的具体语言，请参阅许可证。\n-->\n\n\u003Cp align=\"center\">\n    \u003Cbr>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_accelerate_readme_852e72c7a923.png\" width=\"400\"\u002F>\n    \u003Cbr>\n\u003Cp>\n\n\u003Cp align=\"center\">\n    \u003C!-- 待 CircleCI 配置完成后启用\n    \u003Ca href=\"https:\u002F\u002Fcircleci.com\u002Fgh\u002Fhuggingface\u002Faccelerate\">\u003Cimg alt=\"构建\" src=\"https:\u002F\u002Fimg.shields.io\u002Fcircleci\u002Fbuild\u002Fgithub\u002Fhuggingface\u002Ftransformers\u002Fmaster\">\u003C\u002Fa>\n    -->\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg alt=\"许可证\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fhuggingface\u002Faccelerate.svg?color=blue\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Findex.html\">\u003Cimg alt=\"文档\" src=\"https:\u002F\u002Fimg.shields.io\u002Fwebsite\u002Fhttp\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Findex.html.svg?down_color=red&down_message=离线&up_message=在线\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Freleases\">\u003Cimg alt=\"GitHub 发布\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Fhuggingface\u002Faccelerate.svg\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fblob\u002Fmain\u002FCODE_OF_CONDUCT.md\">\u003Cimg alt=\"贡献者公约\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FContributor%20Covenant-v2.0%20采纳-ff69b4.svg\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Ch3 align=\"center\">\n\u003Cp>在任何类型的设备上运行您的原始 PyTorch 训练脚本\u003C\u002Fh3>\n\n\u003Ch3 align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fhf.co\u002Fcourse\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_accelerate_readme_d40d18c65161.png\">\u003C\u002Fa>\n\u003C\u002Fh3>\n\n## 易于集成\n\n🤗 Accelerate 是为那些喜欢自己编写 PyTorch 模型训练循环，但又不愿编写和维护多 GPU\u002FTPU\u002Ffp16 所需样板代码的 PyTorch 用户而设计的。\n\n🤗 Accelerate 只抽象与多 GPU\u002FTPU\u002Ffp16 相关的样板代码，而保持您的其余代码不变。\n\n以下是一个示例：\n\n```diff\n  import torch\n  import torch.nn.functional as F\n  from datasets import load_dataset\n+ from accelerate import Accelerator\n\n+ accelerator = Accelerator()\n- device = 'cpu'\n+ device = accelerator.device\n\n  model = torch.nn.Transformer().to(device)\n  optimizer = torch.optim.Adam(model.parameters())\n\n  dataset = load_dataset('my_dataset')\n  data = torch.utils.data.DataLoader(dataset, shuffle=True)\n\n+ model, optimizer, data = accelerator.prepare(model, optimizer, data)\n\n  model.train()\n  for epoch in range(10):\n      for source, targets in data:\n          source = source.to(device)\n          targets = targets.to(device)\n\n          optimizer.zero_grad()\n\n          output = model(source)\n          loss = F.cross_entropy(output, targets)\n\n-         loss.backward()\n+         accelerator.backward(loss)\n\n          optimizer.step()\n```\n\n正如您在此示例中看到的那样，只需向任何标准的 PyTorch 训练脚本添加 5 行代码，您就可以在任何单机或分布式节点环境中运行它（单 CPU、单 GPU、多 GPU 和 TPU），并且可以使用或不使用混合精度（fp8、fp16、bf16）。\n\n特别是，相同的代码无需修改即可在您的本地机器上用于调试，也可以在您的训练环境中运行。\n\n🤗 Accelerate 甚至会为您处理设备放置问题（这需要对您的代码进行一些额外的更改，但通常更安全），因此您可以进一步简化训练循环：\n\n```diff\n  import torch\n  import torch.nn.functional as F\n  from datasets import load_dataset\n+ from accelerate import Accelerator\n\n- device = 'cpu'\n+ accelerator = Accelerator()\n\n- model = torch.nn.Transformer().to(device)\n+ model = torch.nn.Transformer()\n  optimizer = torch.optim.Adam(model.parameters())\n\n  dataset = load_dataset('my_dataset')\n  data = torch.utils.data.DataLoader(dataset, shuffle=True)\n\n+ model, optimizer, data = accelerator.prepare(model, optimizer, data)\n\n  model.train()\n  for epoch in range(10):\n      for source, targets in data:\n-         source = source.to(device)\n-         targets = targets.to(device)\n\n          optimizer.zero_grad()\n\n          output = model(source)\n          loss = F.cross_entropy(output, targets)\n\n-         loss.backward()\n+         accelerator.backward(loss)\n\n          optimizer.step()\n```\n\n想了解更多？请查看 [文档](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate) 或浏览我们的 [示例](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Ftree\u002Fmain\u002Fexamples)。\n\n## 启动脚本\n\n🤗 Accelerate 还提供了一个可选的命令行工具，允许您在启动脚本之前快速配置和测试训练环境。无需记住如何使用 `torch.distributed.run`，也不必为 TPU 训练编写特定的启动器！只需在您的机器上运行：\n\n```bash\naccelerate config\n```\n\n并回答相关问题。这将生成一个配置文件，该文件将在执行以下操作时自动用于正确设置默认选项：\n\n```bash\naccelerate launch my_script.py --args_to_my_script\n```\n\n例如，以下是在仓库根目录下运行 MRPC 任务的 GLUE 示例的方法：\n\n```bash\naccelerate launch examples\u002Fnlp_example.py\n```\n\n此命令行工具是**可选的**，您仍然可以根据需要使用 `python my_script.py` 或 `python -m torchrun my_script.py`。\n\n如果您不想运行 `accelerate config`，也可以直接将原本传递给 `torchrun` 的参数作为 `accelerate launch` 的参数传递。例如，以下是在两块 GPU 上启动的方式：\n\n```bash\naccelerate launch --multi_gpu --num_processes 2 examples\u002Fnlp_example.py\n```\n\n要了解更多信息，请查阅此处提供的 CLI 文档：[加速包参考 CLI](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Fpackage_reference\u002Fcli)。\n\n或者查看配置模板库：[配置 YAML 模板](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fblob\u002Fmain\u002Fexamples\u002Fconfig_yaml_templates\u002F)。\n\n## 使用 MPI 启动多 CPU 运行\n\n🤗 以下是另一种使用 MPI 启动多 CPU 运行的方式。您可以在[此页面](https:\u002F\u002Fwww.open-mpi.org\u002Ffaq\u002F?category=building#easy-build)上了解如何安装 Open MPI。您也可以使用 Intel MPI 或 MVAPICH。\n一旦您的集群上设置了 MPI，只需运行：\n```bash\naccelerate config\n```\n回答相关问题，选择以多 CPU 方式运行，并在被问及是否希望 Accelerate 启动 mpirun 时回答“是”。然后，使用 `accelerate launch` 运行您的脚本，例如：\n```bash\naccelerate launch examples\u002Fnlp_example.py\n```\n或者，您也可以直接使用 mpirun，而不通过命令行工具，例如：\n```bash\nmpirun -np 2 python examples\u002Fnlp_example.py\n```\n\n## 使用 DeepSpeed 启动训练\n\n🤗 Accelerate 支持使用 DeepSpeed 在单个或多个 GPU 上进行训练。要使用它，您无需更改训练代码中的任何内容；只需通过 `accelerate config` 即可完成所有设置。不过，如果您希望从 Python 脚本中调整与 DeepSpeed 相关的参数，我们提供了 `DeepSpeedPlugin`。\n\n```python\nfrom accelerate import Accelerator, DeepSpeedPlugin\n\n# deepspeed 需要提前知道您的梯度累积步数，因此请务必传递该参数\n# 请记住，您仍然需要自行执行梯度累积操作，就像在不使用 deepspeed 的情况下一样\ndeepspeed_plugin = DeepSpeedPlugin(zero_stage=2, gradient_accumulation_steps=2)\naccelerator = Accelerator(mixed_precision='fp16', deepspeed_plugin=deepspeed_plugin)\n\n# 如何保存您的 🤗 Transformer？\naccelerator.wait_for_everyone()\nunwrapped_model = accelerator.unwrap_model(model)\nunwrapped_model.save_pretrained(save_dir, save_function=accelerator.save, state_dict=accelerator.get_state_dict(model))\n```\n\n注意：目前 DeepSpeed 支持仍处于实验阶段。如果您遇到任何问题，请提交 issue。\n\n## 从笔记本中启动训练\n\n🤗 Accelerate 还提供了一个 `notebook_launcher` 函数，您可以在笔记本中使用它来启动分布式训练。这对于具有 TPU 后端的 Colab 或 Kaggle 笔记本尤其有用。只需在 `training_function` 中定义您的训练循环，然后在最后一行添加：\n\n```python\nfrom accelerate import notebook_launcher\n\nnotebook_launcher(training_function)\n```\n\n示例可在[此笔记本](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fnotebooks\u002Fblob\u002Fmain\u002Fexamples\u002Faccelerate_examples\u002Fsimple_nlp_example.ipynb)中找到。[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fhuggingface\u002Fnotebooks\u002Fblob\u002Fmain\u002Fexamples\u002Faccelerate_examples\u002Fsimple_nlp_example.ipynb)\n\n## 为什么应该使用 🤗 Accelerate？\n\n当您希望在不放弃对训练循环完全控制的情况下，轻松地在分布式环境中运行训练脚本时，就应使用 🤗 Accelerate。它并不是 PyTorch 之上的高级框架，而是一个轻量级封装，因此您无需学习新的库。事实上，🤗 Accelerate 的整个 API 都集中在 `Accelerator` 类中。\n\n## 为什么不使用 🤗 Accelerate？\n\n如果您不想自己编写训练循环，则不应使用 🤗 Accelerate。PyTorch 之上有许多高级库可以为您提供这一功能，而 🤗 Accelerate 并不属于此类。\n\n## 使用 🤗 Accelerate 的框架\n\n如果您喜欢 🤗 Accelerate 的简洁性，但更倾向于在其功能之上构建更高层次的抽象，以下是一些基于 🤗 Accelerate 构建的框架和库：\n\n* [Amphion](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FAmphion) 是一个用于音频、音乐和语音生成的工具包。其目标是支持可重复的研究，并帮助初级研究人员和工程师入门音频、音乐和语音生成领域的研究与开发。\n* [Animus](https:\u002F\u002Fgithub.com\u002FScitator\u002Fanimus) 是一个极简的机器学习实验框架。Animus 强调机器学习实验中的常见“断点”，并在 [IExperiment](https:\u002F\u002Fgithub.com\u002FScitator\u002Fanimus\u002Fblob\u002Fmain\u002Fanimus\u002Fcore.py#L76) 中为这些断点提供统一的接口。\n* [Catalyst](https:\u002F\u002Fgithub.com\u002Fcatalyst-team\u002Fcatalyst#getting-started) 是一个用于深度学习研究与开发的 PyTorch 框架。它专注于可重复性、快速实验和代码重用，以便您可以创造新东西，而不是再写一个训练循环。Catalyst 提供一个 [Runner](https:\u002F\u002Fcatalyst-team.github.io\u002Fcatalyst\u002Fapi\u002Fcore.html#runner) 来连接实验的所有部分：硬件后端、数据变换、模型训练和推理逻辑。\n* [fastai](https:\u002F\u002Fgithub.com\u002Ffastai\u002Ffastai#installing) 是一个简化了现代最佳实践下快速准确神经网络训练的 PyTorch 深度学习框架。fastai 提供一个 [Learner](https:\u002F\u002Fdocs.fast.ai\u002Flearner.html#Learner) 来处理深度学习算法的训练、微调和推理。\n* [Finetuner](https:\u002F\u002Fgithub.com\u002Fjina-ai\u002Ffinetuner) 是一项服务，使模型能够为语义搜索、视觉相似性搜索、跨模态文本\u003C->图像搜索、推荐系统、聚类、重复检测、异常检测或其他用途创建更高质量的嵌入。\n* [InvokeAI](https:\u002F\u002Fgithub.com\u002Finvoke-ai\u002FInvokeAI) 是 Stable Diffusion 模型的创意引擎，提供行业领先的 WebUI 和终端使用支持，并作为许多商业产品的基础。\n* [Kornia](https:\u002F\u002Fkornia.readthedocs.io\u002Fen\u002Flatest\u002Fget-started\u002Fintroduction.html) 是一个可微分库，允许将经典计算机视觉集成到深度学习模型中。Kornia 提供一个 [Trainer](https:\u002F\u002Fkornia.readthedocs.io\u002Fen\u002Flatest\u002Fx.html#kornia.x.Trainer) 专门用于训练和微调库中支持的深度学习算法。\n* [Open Assistant](https:\u002F\u002Fprojects.laion.ai\u002FOpen-Assistant\u002F) 是一个基于聊天的助手，能够理解任务、与其各方系统交互，并动态获取信息以完成任务。\n* [pytorch-accelerated](https:\u002F\u002Fgithub.com\u002FChris-hughes10\u002Fpytorch-accelerated) 是一个轻量级训练库，其精简的功能集围绕通用的 [Trainer](https:\u002F\u002Fpytorch-accelerated.readthedocs.io\u002Fen\u002Flatest\u002Ftrainer.html) 展开，非常强调简单性和透明度；使用户能够清楚地了解底层正在发生的事情，而无需自己编写和维护样板代码！\n* [Stable Diffusion web UI](https:\u002F\u002Fgithub.com\u002FAUTOMATIC1111\u002Fstable-diffusion-webui) 是一个基于 Gradio 库的开源浏览器界面，专为 Stable Diffusion 设计，易于使用。\n* [torchkeras](https:\u002F\u002Fgithub.com\u002Flyhue1991\u002Ftorchkeras) 是一种以 Keras 风格训练 PyTorch 模型的简单工具，在笔记本中会提供动态且美观的图表来监控损失或指标。\n* [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) 作为帮助在 PyTorch、TensorFlow 和 JAX 中训练最先进机器学习模型的工具。（Accelerate 是 PyTorch 部分的后端）。\n\n## 安装\n\n本仓库已在 Python 3.8+ 和 PyTorch 1.10.0+ 环境下测试通过。\n\n您应在 [虚拟环境](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fvenv.html) 中安装 🤗 Accelerate。如果您不熟悉 Python 虚拟环境，请参阅 [用户指南](https:\u002F\u002Fpackaging.python.org\u002Fguides\u002Finstalling-using-pip-and-virtual-environments\u002F)。\n\n首先，创建一个使用您计划使用的 Python 版本的虚拟环境，并激活它。\n\n接着，您需要安装 PyTorch：请参考 [官方安装页面](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F#start-locally)，根据您的平台选择合适的安装命令。然后，您可以使用 pip 安装 🤗 Accelerate，命令如下：\n\n```bash\npip install accelerate\n```\n\n## 支持的集成\n\n- 仅 CPU\n- 单节点多 CPU\n- 多节点多 CPU\n- 单 GPU\n- 单节点多 GPU\n- 多节点多 GPU\n- TPU\n- FP16\u002FBFloat16 混合精度\n- 使用 [Transformer Engine](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTransformerEngine) 或 [MS-AMP](https:\u002F\u002Fgithub.com\u002FAzure\u002FMS-AMP\u002F) 的 FP8 混合精度\n- DeepSpeed 支持（实验性）\n- PyTorch 全量分片数据并行（FSDP）支持（实验性）\n- Megatron-LM 支持（实验性）\n\n## 引用 🤗 Accelerate\n\n如果您在论文或其他出版物中使用了 🤗 Accelerate，请使用以下 BibTeX 条目进行引用。\n\n```bibtex\n@Misc{accelerate,\n  title =        {Accelerate: Training and inference at scale made simple, efficient and adaptable.},\n  author =       {Sylvain Gugger and Lysandre Debut and Thomas Wolf and Philipp Schmid and Zachary Mueller and Sourab Mangrulkar and Marc Sun and Benjamin Bossan},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate}},\n  year =         {2022}\n}\n```","# 🤗 Accelerate 快速上手指南\n\n🤗 **Accelerate** 是 Hugging Face 推出的一款轻量级库，旨在帮助开发者轻松地在任何硬件配置（单 CPU、单 GPU、多 GPU、TPU）下运行原生 PyTorch 训练脚本。它仅抽象了分布式训练和混合精度所需的样板代码，让你无需学习新框架即可掌控完整的训练循环。\n\n## 环境准备\n\n在开始之前，请确保你的开发环境满足以下要求：\n\n*   **操作系统**: Linux, macOS, 或 Windows\n*   **Python**: 3.8 或更高版本\n*   **核心依赖**:\n    *   PyTorch >= 1.10.0\n    *   (可选) Transformers, Datasets 等 Hugging Face 生态库（若你的项目需要）\n*   **硬件支持**:\n    *   CPU\n    *   NVIDIA GPU (需安装对应的 CUDA 版本)\n    *   Google TPU (需在 Colab\u002FKaggle 或 GCP 环境)\n\n> **提示**：国内用户建议在安装 PyTorch 时使用清华或中科大镜像源，以获得更快的下载速度。\n\n## 安装步骤\n\n你可以使用 `pip` 直接安装最新稳定版。\n\n### 标准安装\n```bash\npip install accelerate\n```\n\n### 使用国内镜像加速安装\n推荐中国开发者使用清华源进行安装：\n```bash\npip install accelerate -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 安装最新开发版（可选）\n如果你需要体验最新功能：\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate.git\n```\n\n## 基本使用\n\n只需对现有的 PyTorch 训练脚本进行极少量的修改（通常仅需 5 行代码），即可使其支持分布式训练和混合精度。\n\n### 1. 初始化 Accelerator\n导入 `Accelerator` 类并实例化，它将自动检测当前环境（CPU\u002FGPU\u002FTPU）。\n\n### 2. 准备对象\n使用 `accelerator.prepare()` 包装模型、优化器和数据加载器。\n\n### 3. 替换反向传播\n将 `loss.backward()` 替换为 `accelerator.backward(loss)`。\n\n### 代码示例对比\n\n以下是将原生 PyTorch 脚本转换为 Accelerate 脚本的最小改动示例：\n\n```python\nimport torch\nimport torch.nn.functional as F\nfrom datasets import load_dataset\n# 1. 导入 Accelerator\nfrom accelerate import Accelerator\n\n# 2. 初始化 accelerator\naccelerator = Accelerator()\n\n# 不再需要手动指定 device = 'cuda' 或 'cpu'，accelerator.device 会自动处理\nmodel = torch.nn.Transformer()\noptimizer = torch.optim.Adam(model.parameters())\n\ndataset = load_dataset('my_dataset')\ndata = torch.utils.data.DataLoader(dataset, shuffle=True)\n\n# 3. 准备模型、优化器和数据\nmodel, optimizer, data = accelerator.prepare(model, optimizer, data)\n\nmodel.train()\nfor epoch in range(10):\n    for source, targets in data:\n        # 4. 数据会自动移动到正确的设备上，无需手动 .to(device)\n        # source = source.to(device) \n        # targets = targets.to(device)\n\n        optimizer.zero_grad()\n\n        output = model(source)\n        loss = F.cross_entropy(output, targets)\n\n        # 5. 使用 accelerator 进行反向传播\n        accelerator.backward(loss)\n\n        optimizer.step()\n```\n\n### 启动训练\n\n配置好代码后，你可以通过 CLI 工具轻松启动训练，无需手动编写 `torch.distributed.run` 命令。\n\n**第一步：配置环境**\n运行以下命令并根据提示回答关于你的硬件环境的问题（如 GPU 数量、是否使用混合精度等）：\n```bash\naccelerate config\n```\n\n**第二步：启动脚本**\n使用 `accelerate launch` 运行你的脚本：\n```bash\naccelerate launch my_script.py --args_to_my_script\n```\n\n如果你不想运行配置向导，也可以直接在命令行传递参数（例如在 2 张 GPU 上运行）：\n```bash\naccelerate launch --multi_gpu --num_processes 2 my_script.py\n```\n\n对于本地调试，你依然可以直接使用 `python my_script.py` 运行，代码无需修改。","某初创公司的算法工程师团队正试图在有限的算力资源下，快速复现并微调一个大型语言模型，以适配垂直领域的客服场景。\n\n### 没有 accelerate 时\n- **环境适配痛苦**：代码硬编码了单卡逻辑，一旦需要切换到多卡分布式训练或 TPU 环境，必须重写大量设备管理和进程启动代码。\n- **混合精度配置繁琐**：手动编写 AMP（自动混合精度）逻辑极易出错，且难以灵活切换 fp16、bf16 或最新的 fp8 格式，导致显存利用率低或训练不稳定。\n- **调试与部署割裂**：本地调试用的单 CPU\u002F单 GPU 脚本无法直接用于生产环境的分布式集群，维护两套代码增加了出错风险和时间成本。\n- **大模型训练门槛高**：想要使用 DeepSpeed 或 FSDP 进行大模型训练，需要深入理解复杂的并行策略配置，普通开发者难以上手。\n\n### 使用 accelerate 后\n- **一次编写，随处运行**：仅需在原有 PyTorch 脚本中插入 5 行 accelerate 代码，即可无缝在单卡、多卡、TPU 及不同混合精度模式间切换，无需修改核心逻辑。\n- **智能资源管理**：accelerate 自动处理设备放置和梯度反向传播，原生支持 fp8 等先进精度，显著降低显存占用并提升训练速度。\n- **开发流程统一**：同一套代码既可在本地笔记本上快速调试，也能直接提交到大规模集群进行全量训练，彻底消除环境差异带来的维护负担。\n- **大模型训练平民化**：通过简单的配置文件即可启用 DeepSpeed 或 FSDP 支持，让团队无需成为分布式系统专家也能高效训练参数量巨大的模型。\n\naccelerate 将开发者从繁琐的底层基础设施代码中解放出来，让 PyTorch 模型训练真正实现了“关注模型本身，而非运行环境”。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_accelerate_3b40b89b.png","huggingface","Hugging Face","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhuggingface_90da21a4.png","The AI community building the future.",null,"https:\u002F\u002Fhuggingface.co\u002F","https:\u002F\u002Fgithub.com\u002Fhuggingface",[23,27,31,34],{"name":24,"color":25,"percentage":26},"Python","#3572A5",99.6,{"name":28,"color":29,"percentage":30},"Dockerfile","#384d54",0.2,{"name":32,"color":33,"percentage":30},"Makefile","#427819",{"name":35,"color":36,"percentage":37},"Shell","#89e051",0,9609,1335,"2026-04-17T16:36:38","Apache-2.0",2,"未说明","非必需。支持单 CPU、单 GPU、多 GPU 及 TPU 环境。若使用 GPU，需兼容 PyTorch 的 NVIDIA GPU（具体型号和显存取决于模型大小），支持混合精度训练 (fp8, fp16, bf16)。",{"notes":46,"python":43,"dependencies":47},"该工具是一个轻量级封装库，旨在让用户无需大幅修改代码即可在多种硬件配置（CPU\u002FGPU\u002FTPU\u002F多卡）上运行原生 PyTorch 训练脚本。支持通过 'accelerate config' 命令行工具配置环境，也支持 DeepSpeed 集成（实验性）和 MPI 多 CPU 启动。若在 Notebook（如 Colab\u002FKaggle）中运行，提供专用的 notebook_launcher 函数。",[48,49],"torch","PyTorch",[51],"开发框架","ready","2026-03-27T02:49:30.150509","2026-04-18T14:14:12.730583",[56,61,66,71,76,81],{"id":57,"question_zh":58,"answer_zh":59,"source_url":60},39756,"使用多 GPU 训练时遇到 NCCL 超时错误（Watchdog caught collective operation timeout）怎么办？","该错误通常发生在单节点多 GPU 分布式训练（DDP）过程中，表现为运行几个步骤后随机出现 NCCL 超时。这通常是由于通信阻塞或异步 CUDA 内核导致的数据不一致。虽然具体原因可能涉及网络或硬件，但建议检查以下几点：\n1. 确保所有 GPU 之间的网络连接正常且带宽充足。\n2. 尝试增加 `timeout` 参数值（默认通常为 1800000ms）。\n3. 确认使用的是 accelerate 封装的分布式实现而非原生 torch.distributed，以利用其内置的容错机制。\n4. 如果是扩散模型等特定任务，检查是否有特定的同步操作导致阻塞。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fissues\u002F314",{"id":62,"question_zh":63,"answer_zh":64,"source_url":65},39757,"遇到 ValueError: FlatParameter requires uniform dtype but got torch.float16 and torch.float32 错误如何解决？","此错误常见于使用 PEFT 库结合 Accelerate 进行混合精度训练时。解决方案是修改 PEFT 模型的初始化方式：\n将原来的：\n`transformer = get_peft_model(transformer, transformer_lora_config)`\n改为：\n`transformer.add_adapter(transformer_lora_config)`\n这种改变可以避免数据类型不一致导致的扁平化参数错误，特别适用于基于 diffusers 的扩散模型训练。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fissues\u002F1620",{"id":67,"question_zh":68,"answer_zh":69,"source_url":70},39758,"如何将 Accelerate 与 Hugging Face Trainer 结合使用？需要修改代码吗？","不需要修改任何代码。只需执行以下步骤：\n1. 运行 `accelerate config` 命令。\n2. 在配置向导中选择 FSDP 或 DeepSpeed 作为分布式后端。\n3. 生成新的配置文件。\n4. 使用 `accelerate launch your_script.py` 启动训练即可。\nAccelerate 会自动处理分布式逻辑，原有的 Hugging Face Trainer 代码无需变动。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fissues\u002F144",{"id":72,"question_zh":73,"answer_zh":74,"source_url":75},39759,"如何在 TPU Pod 上使用 Accelerate 进行训练？","要在 TPU Pod 上使用 Accelerate，请按以下步骤操作：\n1. 安装最新版的 Accelerate：`pip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate`\n2. 确保每个节点都安装了该版本。\n3. 关键步骤：要么通过 git 安装 `torch_xla`，要么在主节点运行以下命令修复缺失模块：\n   `wget https:\u002F\u002Fraw.githubusercontent.com\u002Fpytorch\u002Fxla\u002Fmaster\u002Ftorch_xla\u002Fdistributed\u002Fxla_dist.py -O \u002Fusr\u002Flocal\u002Flib\u002Fpython3.8\u002Fdist-packages\u002Ftorch_xla\u002Fdistributed\u002Fxla_dist.py`\n4. 在主节点运行 `accelerate config` 并进行相应配置。\n5. 如果系统需要 sudo 权限安装，请在配置提示时将相关选项设为 `True`。\n完成上述配置后，即可使用 `accelerate launch` 启动 TPU Pod 训练任务。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fissues\u002F501",{"id":77,"question_zh":78,"answer_zh":79,"source_url":80},39760,"FSDP 相比 DeepSpeed 是否需要更大的学习率才能以相似速度收敛？","是的，根据用户反馈和实验结果，在使用 FSDP（Fully Sharded Data Parallel）时，通常需要比 DeepSpeed ZeRO-3 更大的学习率才能达到相似的收敛速度。这是因为 FSDP 的分片策略和梯度聚合方式与 DeepSpeed 不同，可能导致优化动态有所差异。建议在切换到 FSDP 时：\n1. 适当提高初始学习率（例如增加 1.5 到 2 倍）。\n2. 使用学习率预热（warmup）策略来稳定训练初期。\n3. 监控损失曲线，必要时调整学习率调度器。\n具体最佳学习率需根据模型大小和数据集进行微调。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fissues\u002F2624",{"id":82,"question_zh":83,"answer_zh":84,"source_url":65},39761,"在使用 FSDP 配置时遇到 'Cannot copy out of meta tensor' 错误怎么办？","该错误通常出现在启用 `cpu_ram_efficient_loading: True` 时，原因是尝试从 meta tensor 直接复制数据。解决方法是：\n不要使用 `torch.nn.Module.to()` 将模块从 meta 设备移动到其他设备，而应改用：\n`torch.nn.Module.to_empty(device='cuda')`\n然后再加载状态字典。此外，确保你的 transformers 和 accelerate 库版本兼容（如 transformers>=4.56.2, accelerate>=1.10.1），并检查 FSDP 配置中是否正确设置了 `fsdp_cpu_ram_efficient_loading` 和相关加载逻辑。",[86,91,96,101,106,111,116,121,126,131,136,141,146,151,156,161,166,171,176,181],{"id":87,"version":88,"summary_zh":89,"released_at":90},315688,"v1.13.0","## AWS Neuron 支持\n我们现在支持 AWS Neuron（Trainium\u002FInferentia）设备。感谢 @michaelbenayoun 的加入。\n- Neuron 集成由 @michaelbenayoun 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3935 中完成。\n\n### XPU 改进\n我们已移除 IPEX 依赖，并改进了 XPU 的设备无关代码。\n- 使用 spawn 而不是 fork 来处理 XPU 设备，由 @kaixuanliu 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3884 中实现。\n- 移除 IPEX，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3883 中完成。\n- 增强 XPU 相关的新代码，使其具备设备无关性，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3890 中实现。\n- 修复非 CPU 训练时 KMP_AFFINITY 设置错误的问题，由 @hexfaker 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3912 中完成。\n\n## FSDP2 改进\n我们为 FSDP2 用户添加了一系列重要修复：仅对需要梯度的参数进行上转型、改进绑定嵌入错误提示、DCP 优化器加载问题、修复 bf16 优化器步进崩溃以及兼容 torch \u003C 2.7.0。\n\n- 仅当参数需要梯度时才对 FSDP2 参数进行上转型，由 @ojh31 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3848 中实现。\n- 修复 FSDP2 绑定嵌入错误，并提供针对性的 ValueError 指导，由 @amanzoni1 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3878 中完成。\n- 修复 FSDP2 无法使用 DCP 加载优化器状态的 bug，由 @flymin 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3904 中完成。\n- 修复启用 FSDP2 且模型为 bfloat16 时 optimizer.step 函数崩溃的问题，由 @sywangyi 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3905 中完成。\n- 修复在 torch \u003C 2.7.0 版本下使用 ignored_params 时 FSDP2 崩溃的问题，由 @Mr-Neutr0n 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3924 中完成。\n\n## DeepSpeed 序列并行\n我们为 v1.12.0 引入的 DeepSpeed + 序列并行集成添加了多项修复，包括 SP 训练期间的评估支持以及正确处理进程组的问题。\n- [SP] 修复损失计算示例，由 @kashif 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3858 中完成。\n- [SP 和 CP] 如果同时启用 CP 和 SP，则报错，由 @kashif 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3862 中完成。\n- DeepSpeed 拥有自己的进程组，由 @kashif 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3916 中实现。\n- [DeepSpeed] 当 DeepSpeed 启用且 sp_size > 1 时，跳过设备网格的创建，由 @kashif 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3914 中完成。\n- 允许在 DeepSpeed 序列并行模式下进行评估，由 @jp1924 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3917 中完成。\n\n### FP8\n我们增强了 FP8 训练功能。感谢 @shimizust 修复了对 torchao 的支持。\n- 修复 FP8 torchao 默认配置中的填充问题，并支持 FSDP2 的 all-gather 操作，由 @shimizust 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3831 中完成。\n- 修复与 Transformer Engine 的执行问题，由 @ksivaman 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3852 中完成。\n- 添加 MS-AMP 已弃用的警告信息，由 @neha222222 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3857 中完成。\n\n### 性能\nAccelerate 现在通过延迟加载重型依赖项来加快导入速度。","2026-03-04T20:15:55",{"id":92,"version":93,"summary_zh":94,"released_at":95},315689,"v1.12.0","## Deepspeed Ulysses\u002FALST 集成\n\nDeepspeed Ulysses\u002FALST 是一种通过使用序列并行和注意力头并行来高效训练长序列的方法。您可以通过这篇论文 https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13996 或 DeepSpeed 教程 https:\u002F\u002Fwww.deepspeed.ai\u002Ftutorials\u002Fulysses-alst-sequence-parallelism\u002F 了解更多关于这项技术的信息。\n\n\u003Cimg width=\"2368\" height=\"1250\" alt=\"0d8bd9e0\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb94e90c9-4368-4711-ad57-58de3c714ebc\" \u002F>\n\n\n要启用 Deepspeed Ulysses，您首先需要创建 `ParallelismConfig` 并设置与 `sp` 相关的参数：\n\n```python\nparallelism_config = ParallelismConfig(\n    sp_backend=\"deepspeed\",\n    sp_size=2,\n    sp_handler=DeepSpeedSequenceParallelConfig(...),\n)\n```\n然后，您需要确保按照我们 [文档](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Fmain\u002Fen\u002Fconcept_guides\u002Fsequence_parallelism) 中的说明计算正确的损失：\n\n```python \n        ...\n        losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)\n        good_tokens = (shift_labels != -100).view(-1).sum()\n        good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)\n        total_loss = sum(\n            losses_per_rank[rank] * good_tokens_per_rank[rank]\n            for rank in range(sp_world_size)\n            if good_tokens_per_rank[rank] > 0\n        )\n        total_good_tokens = sum(good_tokens_per_rank)\n        loss = total_loss \u002F max(total_good_tokens, 1)\n```\n\n感谢 @S1ro1 开始这项工作，以及 @stas00 完成这项工作。同时也要感谢 @kashif 添加文档并审查和测试了这个 PR！\n\n该功能也将通过 @stas00 的 PR 在 HF Trainer 中可用：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F41832\n\n\n## 小幅改动\n\n* 移除关于 `cpu_ram_efficient_loading` 的警告，由 @SunMarc 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3816 中完成\n* 修复 bnb 量化 4 位标志文档字符串中的拼写错误，由 @hbraith 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3828 中完成\n* 将 ArXiv 替换为 HF Papers，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3834 中完成\n* 修复 broadcast_object_list 文档字符串中的拼写错误，由 @wsntxxn 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3823 中完成\n* 【Bug】在张量并行后更新 torch.optim.Optimizer 参数状态，由 @naomili0924 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3835 中完成\n* 使用自托管运行器，由 @SunMarc 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3841 中完成\n* 设备类型辅助函数，由 @kashif 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3843 中完成\n\n## 新贡献者\n* @hbraith 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3828 中做出了他们的首次贡献\n* @wsntxxn 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3823 中做出了他们的首次贡献\n* @naomili0924 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3835 中做出了他们的首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingf","2025-11-21T12:47:55",{"id":97,"version":98,"summary_zh":99,"released_at":100},315690,"v1.11.0","## TE MXFP8 支持\n\n我们在 TransformerEngine 集成中新增了对 MXFP8 的支持。要使用该功能，您需要在 `fp8_config` 中设置 `use_mxfp8_block_scaling`。详情请参阅 NVIDIA 文档 [此处]。（https:\u002F\u002Fdocs.nvidia.com\u002Fdeeplearning\u002Ftransformer-engine\u002Fuser-guide\u002Fexamples\u002Ffp8_primer.html#MXFP8-and-block-scaling）\n\n* @pstjohn 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3688 中为 Accelerate 添加了 TE MXFP8 配方的支持。\n\n## MPS 设备上的 FP16\u002FBF16 训练\n\nMPS 设备的 BF16 和 FP16 支持终于上线。现在您可以在 Mac 上训练时传递 `mixed_precision = \"fp16\" 或 \"bf16\"`（`fp16` 需要 PyTorch 2.8，`bf16` 需要 PyTorch 2.6）。\n* @SunMarc 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3373 中为 AMP 添加了 MPS 设备的 bf16\u002Ffp16 支持。\n\n## FSDP 更新\n\n以下 PR 分别为 FSDPv2 新增了对 `ignored_params` 和 `no_sync()` 的支持：\n* 功能：为 fsdp2 添加 ignored_params 支持，由 @kmehant 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3731\n* 修复：应在 FSDP2 中调用 `model.set_requires_gradient_sync(False)` 来关闭梯度同步，由 @EquationWalker 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3762\n\n混合精度现在可以通过 Accelerate CLI 标志或加速配置文件中的 `fsdp_config` 以 dtype 字符串的形式传递：\n* 功能：允许将混合精度策略作为 dtype 使用，由 @kmehant 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3751\n\n## Nd-并行更新\n\n关于 nd-并行的一些小更新。\n\n* @sergiopaniego 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3761 中修正了上下文并行文档中的错别字。\n* 功能：添加 to_json 方法，由 @S1ro1 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3743。\n* 使 torch_native_parallelism 示例与设备无关，由 @yao-matrix 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3759。\n* [ND 并行] 更新示例并进行清理，由 @S1ro1 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3737。\n\n## 升级到 Python 3.10\n\n我们已停止对 Python 3.9 的支持，因为该版本已于十月达到生命周期结束。\n* 升级到 Python 3.10 并更新 linter，由 @SunMarc 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3809。\n\n### 大量小修复：\n* 修复：针对 nd 或 HSDP 并行的 CPU 内存高效加载，由 @kmehant 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3740。\n* 解决 XPU INT64 all_gather 问题，该问题已在 2.9 版本中修复，由 @yao-matrix 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3756。\n* 为 PartialState 的 torch.distributed.barrier 指定 device_ids，由 @qgallouedec 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3744。\n* 修复：在示例用法中为 process_tensor 指定设备，由 @qgallouedec 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3755。\n* 通过添加 set 来降低 get_balanced_memory 的复杂度，由 @SamuelBarryCS 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3776。\n* 修复：当原始设备为 `cpu` 并被卸载到 `meta` 时，跳过 CUDA 缓存刷新，由 @Qubitium 提交于 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3796。\n* 修复不带偏置的 LayerNorm 转换问题。","2025-10-20T16:08:58",{"id":102,"version":103,"summary_zh":104,"released_at":105},315691,"v1.10.1","- 功能：@S1ro1 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3743 中添加了 to_json 方法。\n- 保护 device_mesh 的导入：@SunMarc 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3742 中进行了相关修改。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fcompare\u002Fv1.10.0...v1.10.1","2025-08-25T13:57:15",{"id":107,"version":108,"summary_zh":109,"released_at":110},315692,"v1.10.0","# N 维并行\n\n在多块 GPU 上训练大型模型可能很复杂，尤其是在结合使用[不同的并行策略](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnanotron\u002Fultrascale-playbook)（例如 TP、CP、DP）时。为了简化这一过程，我们与 [Axolotl](https:\u002F\u002Fgithub.com\u002Faxolotl-ai-cloud\u002Faxolotl\u002F) 合作，推出了一种易于使用的集成，允许您直接在训练脚本中应用任意组合的并行策略。只需传递一个 `ParallelismConfig`，指定每种并行类型的规模——就是这么简单。更多关于其工作原理的信息，请参阅我们最新的[博客文章](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fblog\u002Fpull\u002F3006)。\n\n```python\nparallelism_config = ParallelismConfig(\n    dp_shard_size=2,\n    dp_replicate_size=2,\n    cp_size=2,\n    tp_size=2,\n)\naccelerator = Accelerator(\n    parallelism_config=parallelism_config,\n   ...\n)\nmodel = AutoModelForCausalLM.from_pretrained(\"your-model-name\", device_mesh=accelerator.torch_device_mesh)\nmodel = accelerator.prepare(model)\n```\n\n* 并行配置 + TP + HSDP + BYODM（自带设备网格）由 @SalmanMohammadi 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3682 中实现\n* 功能：上下文并行 2.0 由 @S1ro1 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3700 中实现\n* 设置默认的 submesh_tp_size 以防止未设置的局部变量错误，由 @winglian 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3687 中实现\n* 向 Accelerator 类添加并行属性 getter，由 @WoosungMyung 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3703 中实现\n* 修复：即使仅指定了 TP，prepare 方法也能正常工作（虽然这种情况很少见），由 @S1ro1 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3707 中实现\n* 由于 Trainer 会重置 State，因此在构造函数中设置 parallelism_config，由 @winglian 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3713 中实现\n* 修复：tp 大小无法从环境变量中读取的问题，由 @S1ro1 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3716 中实现\n* 从 `PartialState` 中移除 `ParallelismConfig`，由 @SunMarc 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3720 中实现\n\n\n# FSDP 改进\n\n我们修复了被忽略模块的属性问题。现在，可以训练包含 moe 层且包含 `q_proj` 和 `v_proj` 参数的 PEFT 模型。这对于微调 `gpt-oss` 模型尤为重要。\n\n* 增强：允许 FSDP 被忽略的模块使用正则表达式，由 @BenjaminBossan 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3698 中实现\n* 测试：为 FSDP 被忽略模块作为字符串的情况添加测试，由 @BenjaminBossan 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3719 中实现\n\n# 小改进\n* 功能：CPU 卸载 pre_forward 不会在已位于设备上的情况下尝试移动，由 @JoeGaffney 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3695 中实现\n* 修复：确保 Accelerate 中环境变量值不区分大小写，由 @jp1924 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3712 中实现\n* 移除 use_ipex，由 @SunMarc 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3721 中实现\n\n# 新贡献者\n* @SalmanMohammadi 进行了他们的首次贡献，在","2025-08-07T13:39:44",{"id":112,"version":113,"summary_zh":114,"released_at":115},315693,"v1.9.0","# Trackio 追踪器支持\n我们新增了对 trackio 的支持，这是一个轻量级、💯 免费的实验跟踪 Python 库，构建在 🤗 Datasets 和 Spaces 之上。\n\n![屏幕录制 2025-06-11 下午5:39:32](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F5cf12286-54e7-4119-8a20-88c2cbd37ab6)\n\n主要特性包括：\n- *本地优先* 设计：仪表板默认在本地运行。你也可以通过指定 `space_id` 将其托管在 Spaces 上。\n- 日志会持久化到本地（或私有的 Hugging Face 数据集）。\n- 可以使用 Gradio 仪表板在本地（或 Hugging Face Spaces 上）可视化实验。\n- 此处的所有功能，包括在 Hugging Face 上的托管，都是**免费**的！\n\n要将其与 accelerate 配合使用，你需要设置 `log_with` 并初始化追踪器：\n```python\naccelerator = Accelerator(log_with=\"trackio\")\nconfig={\"learning_rate\": 0.001, \"batch_size\": 32}\n# init_kwargs 用于将仪表板托管在 spaces 上\ninit_kwargs = {\"trackio\": {\"space_id\": \"hf_username\u002Fspace_name\"}}\naccelerator.init_trackers(\"example_project\", config=config, init_kwargs=init_kwargs)\n```\n感谢 @pcuenca 的集成！\n* trackio 由 @pcuenca 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3669 中实现。\n\n## 使用 `set_module_tensor_to_device` 加速模型加载\n在清空缓存的同时设置张量非常慢，因此我们添加了 `clear_device` 选项来禁用该行为。另一个小优化是在所有地方都使用 `non_blocking`，并在将控制权交还给用户之前再进行同步。这使得加载速度略有提升。\n* 使 Diffusers 中的模型加载速度提升 4–5 倍 ⚡，由 @a-r-r-o-w 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3674 中实现。\n\n## FDSP、Deepspeed、FP8 的小幅改进\n\n* 添加对 e5e2 的支持，并在使用启动器时默认启用混合模式，由 @IlyasMoutawwakil 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3640 中实现。\n* 修复 FP8 测试，并允许在不直接配置 `Accelerator()` 的情况下使用 FP8，由 @pstjohn 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3677 中完成。\n* 一系列 FSDP 改进，由 @S1ro1 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3671 中提出。\n* 修复：当使用 DDP + Dtensor 模型时正确报错，由 @S1ro1 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3629 中完成。\n* 修复 fsdp2 示例中的拼写错误，由 @shimizust 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3657 中完成。\n* 在 no_sync() 中添加检查，以避免在使用 deepspeed zero2\u002F3 时出现错误，由 @xliu0105 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3656 中实现。\n\n## 🚨🚨🚨 破坏性变更 🚨🚨🚨\n`find_executable_batch_size()` 不再会在每次发生 OOM 后将批次大小减半。取而代之的是，我们将批次大小乘以 0.9。这有助于用户避免浪费 GPU 资源。\n\n* “别再把我的批次大小减半了！” · 默认退避系数从 0.5 调整为 0.9，由 @SunMarc 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3684 中实现。\n\n## 已更改内容\n\n* [拼写错误] 从“shard”改为“shards”，由 @SunMarc 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3645 中修正。\n* 文档：修复梯度累积指南中的拼写错误，由 @kilavvy 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3649 中完成。\n* xpu 启用","2025-07-16T16:35:54",{"id":117,"version":118,"summary_zh":119,"released_at":120},315694,"v1.8.1","- 在 @IlyasMoutawwakil 的贡献下，添加对 e5e2 的支持，并在使用启动器时默认采用混合模式：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3640\n- 分片功能由 @SunMarc 实现：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3645\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fcompare\u002Fv1.8.0...v1.8.1","2025-06-20T15:43:33",{"id":122,"version":123,"summary_zh":124,"released_at":125},315695,"v1.8.0","# FSDPv2 重构 + FP8 支持\n\n我们简化了 FSDPv2 模型的准备方式，因为之前将 FSDP2 与其他功能（如 FP8、torch.compile、激活检查点等）组合的方式过于复杂。虽然现在的设置限制更多，但能够减少错误并带来更高效的使用体验。此外，我们还新增了对 FP8 的支持。您可以在此处查看相关结果：[链接](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Ftree\u002Fmain\u002Fexamples\u002Ffsdp2)。感谢 @S1ro1 的贡献！\n\n* [FSDP2] 重构 + FP8，由 @S1ro1 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3585 中实现。\n\n# 英特尔 CPU 上更快的分布式训练\n\n我们更新了 `CCL_WORKER_COUNT` 变量，并为英特尔 CPU 用户添加了 `KMP` 参数。这显著提升了分布式训练性能（例如张量并行），在使用英特尔第四代至强处理器训练 Transformer 张量并行模型时，速度最高可提升 40%。\n\n* 简单启动中设置 ccl 和 KMP 参数，由 @jiqing-feng 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3575 中完成。\n\n# DeepSpeed 的区域编译支持\n\n我们为 DeepSpeed 引擎新增了区域编译的支持。DeepSpeed 的 `.compile()` 方法会直接在原地修改模型，使用的是 `torch.nn.Module.compile(...)`，而非非原地的 `torch.compile(...)`，因此我们需要对此进行适配。感谢 @IlyasMoutawwakil 的这项功能！\n\n* 修复 DeepSpeed 区域编译问题，由 @IlyasMoutawwakil 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3609 中完成。\n\n# ipex.optimize 已弃用\n`ipex.optimize` 将被弃用。大部分优化已合并到 PyTorch 主干中，未来的改进也将直接在 PyTorch 中进行。对于尚未升级到 PyTorch 2.8 的用户，我们目前仍将继续依赖 IPEX。\n\n* 在 Accelerate 中移除 ipex.optimize，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3608 中完成。\n\n# 更完善的 XPU 支持\n我们大幅扩展并稳定了对英特尔 XPU 的支持：\n\n* 在 XPU 上启用 FSDP2 基准测试，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3590 中完成。\n* 在 XPU 上启用大模型推理功能，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3595 中完成。\n* 在 XPU 上启用 `test_load_checkpoint_and_dispatch_with_broadcast` 测试用例，由 @yao-matrix 完成。\n* 在 XPU 上启用 CLI 和示例测试用例，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3578 中完成。\n* 在 XPU 上启用 TorchAO 和 Pippy 测试用例，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3599 中完成。\n* 在 XPU 上启用区域编译基准测试，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3592 中完成。\n* 修复 XPU 上 8 位数值加载问题，由 @jiqing-feng 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3623 中完成。\n* 添加设备无关的 GradScaler，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3588 中完成。\n* 在 TorchTensorParallelPlugin 中添加 XPU 支持，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3627 中完成。\n\n# 跟踪工具\n\n我们新增了对 [SwanLab](https:\u002F\u002Fgithub.com\u002FSwanHubX\u002FSwanLab) 的支持，作为实验跟踪后端。非常感谢！","2025-06-19T15:37:06",{"id":127,"version":128,"summary_zh":129,"released_at":130},315696,"v1.7.0","# 区域编译\n\n与一次性编译整个模型不同，区域编译首先针对重复的模块（如解码器层）进行优化。这样，编译器可以缓存并复用已优化的代码，用于后续的模块，从而显著减少首次推理时常见的冷启动编译时间。感谢 @IlyasMoutawwakil 提供这一功能！您可以在此处查看完整的基准测试结果：[链接](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Ftree\u002Fmain\u002Fbenchmarks\u002Ftorch.compile)，并查阅我们更新的[编译指南](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Fen\u002Fusage_guides\u002Fcompilation)，以获取更多详细信息！\n\n![compilation_time-1](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F38795d12-6ee7-4a10-84c6-d29a0877e36c)\n\n要启用此功能，请在 `TorchDynamoPlugin` 配置中设置 `use_regional_compilation=True`。\n\n```python\n# 配置编译后端\ndynamo_plugin = TorchDynamoPlugin(\n    use_regional_compilation=True,\n    ... # 其他参数\n)\n# 使用该插件初始化加速器\naccelerator = Accelerator(dynamo_plugin=dynamo_plugin)\n# 这将对您的模型应用区域编译\nmodel = accelerator.prepare(model)\n```\n\n# 层级类型转换钩子\n\n我们引入了一个新的钩子，可以在推理过程中实现逐层的上转换和下转换（例如针对线性层）。这使得用户能够以不同的存储和计算数据类型运行模型，从而节省内存。这一概念最初是在 [diffusers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdiffusers\u002Fmain\u002Fen\u002Foptimization\u002Fmemory#layerwise-casting) 中实现的，在那里将模型下转换为 FP8 格式被证明在不显著降低质量的情况下非常有效。该功能由 @sayakpaul 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3427 中贡献。\n\n```python\nmodel = ....\nstorage_dtype = torch.float8_e4m3fn\ncompute_dtype = torch.bfloat16\nattach_layerwise_casting_hooks(\n            model,\n            storage_dtype=storage_dtype,\n            compute_dtype=compute_dtype,\n        )\n```\n\n# 更完善的 FSDP2 支持\n\n本次发布包含多项新功能和错误修复。值得注意的是，我们新增了对 FSDP 中广泛使用的 `FULL_STATE_DICT` 选项的支持，现在使 transformers 库中的 `.save_pretrained()` 方法能够与 FSDP2 包装的模型配合使用。QLoRA 训练也已得到支持，但仍需进一步测试。此外，我们还修复了一个与参数卸载到 CPU 相关的后端问题。同时，解决了在启用 `cpu_ram_efficient_loading=True` 时出现的内存峰值问题。其他一些小的改进和修复也一并包含在内——完整详情请参阅“变更内容”部分。\n\n- `FULL_STATE_DICT` 功能由 @S1ro1 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3527 中实现。\n- QLoRA 支持由 @winglian 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3546 中添加。\n- 在 CUDA+FSDP2+CPU 卸载场景下正确设置后端的问题已在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3574 中解决。\n- 启用 `cpu_ram_efficient_loading=True` 时的内存峰值问题已由 @S1ro1 在 h 处修复。","2025-05-15T12:33:48",{"id":132,"version":133,"summary_zh":134,"released_at":135},315697,"v1.6.0","# FSDPv2 支持  \n此版本新增了对 [FSDPv2](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fdistributed.fsdp.fully_shard.html) 的支持，感谢 @S1ro1 的贡献。  \n\n如果您使用的是 Python 代码，需要在 `FullyShardedDataParallelPlugin` 中设置 `fsdp_version=2`：  \n```python \nfrom accelerate import FullyShardedDataParallelPlugin, Accelerator\n\nfsdp_plugin = FullyShardedDataParallelPlugin(\n    fsdp_version=2\n    # 其他选项...\n)\naccelerator = Accelerator(fsdp_plugin=fsdp_plugin)\n```  \n\n如果希望将包含 FSDPv1 配置的 YAML 配置文件转换为 FSDPv2 格式，可以使用我们的转换工具：  \n```  \naccelerate to-fsdp2 --config_file config.yaml --output_file new_config.yaml  \n```  \n\n要了解更多关于 FSDPv1 和 FSDPv2 之间的区别，请阅读以下 [文档](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Fmain\u002Fen\u002Fconcept_guides\u002Ffsdp1_vs_fsdp2)。  \n\n# DeepSpeed TP 支持  \n\n我们已添加对 DeepSpeed + TP 的初步支持。由于 DeepSpeed 的 API 本身已经兼容，因此无需进行太多改动。我们只需确保数据加载器与 TP 兼容，并且能够保存 TP 权重即可。感谢 @inkcherry 的工作！https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3390。  \n\n要在 DeepSpeed 中使用 TP，您需要在 DeepSpeed 配置文件中添加 `tensor_parallel` 键来更新设置：  \n```  \n    ....\n    \"tensor_parallel\": {\n      \"autotp_size\": ${autotp_size}\n    },\n   ...\n```  \n更多详情请参阅 DeepSpeed 的 [PR](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed\u002Fpull\u002F6922)。  \n\n# 对 XCCL 分布式后端的支持  \n\n我们新增了对 XCCL 的支持，XCCL 是英特尔开发的一种分布式后端，可用于 XPU 设备。更多详情请参阅 PyTorch 的 [PR](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fissues\u002F141741)。感谢 @dvrogozh 的 [集成](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3401)!  \n\n## 变更内容  \n* 向 MLflowTracker 添加了 `log_artifact`、`log_artifacts` 和 `log_figure` 功能，由 @luiz0992 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3419 中实现。  \n* 为 DeepSpeed 加速器实现了张量并行数据加载器，由 @inkcherry 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3390 中实现。  \n* 修复生产环境问题，由 @muellerzr 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3441 中完成。  \n* 修复 DeepSpeed TP 中的属性问题，由 @SunMarc 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3443 中解决。  \n* 修正多节点 FSDP Slurm 示例脚本中的拼写错误，由 @JacobB33 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3447 中完成。  \n* 新增功能：为 DeepSpeed 添加 `no_ssh` 和 Slurm 多节点启动器选项，由 @hsmallbone 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3329 中实现。  \n* 修复 ao 模块过滤函数的问题，由 @muellerzr 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3450 中完成。  \n* 移除 XPU 上的设备索引 workaround，因为 XPU 现在像 CUDA 一样支持整数型设备索引，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3448 中完成。  \n* 在 XPU 上启用 2 个单元测试用例，由 @yao-matrix 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3445 中实现。  \n* Fi","2025-04-01T13:48:21",{"id":137,"version":138,"summary_zh":139,"released_at":140},315698,"v1.5.2","**Bug Fixes**:\r\n* Fixed an issue with `torch.get_default_device()` requiring a higher version than what we support\r\n* Fixed a broken `pytest` import in prod\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fcompare\u002Fv1.5.0...v1.5.2","2025-03-14T14:16:16",{"id":142,"version":143,"summary_zh":144,"released_at":145},315699,"v1.5.0","## HPU Support\r\n* Adds in HPU accelerator support for 🤗 Accelerate\r\n\r\n\r\n## What's Changed\r\n* [bug] fix device index bug for model training loaded with bitsandbytes by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3408\r\n* [docs] add the missing `import torch` by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3396\r\n* minor doc fixes by @nbroad1881 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3365\r\n* fix: ensure CLI args take precedence over config file. by @cyr0930 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3409\r\n* fix: Add `device=torch.get_default_device()` in `torch.Generator`s by @saforem2 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3420\r\n* Add Tecorigin SDAA accelerator support by @siqi654321 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3330\r\n* fix typo : thier -> their by @hackty in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3423\r\n* Fix quality by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3424\r\n* Distributed inference example for llava_next by @VladOS95-cyber in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3417\r\n* HPU support by @IlyasMoutawwakil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3378\r\n\r\n## New Contributors\r\n* @cyr0930 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3409\r\n* @saforem2 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3420\r\n* @siqi654321 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3330\r\n* @hackty made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3423\r\n* @VladOS95-cyber made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3417\r\n* @IlyasMoutawwakil made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3378\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fcompare\u002Fv1.4.0...v1.5.0","2025-03-12T14:18:54",{"id":147,"version":148,"summary_zh":149,"released_at":150},315700,"v1.4.0","## `torchao` FP8, initial Tensor Parallel support, and memory leak fixes\r\n\r\n### `torchao` FP8\r\n\r\nThis release introduces a new FP8 API and brings in a new backend: [`torchao`](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao\u002Ftree\u002Fmain\u002Ftorchao\u002Ffloat8). To use, pass in `AORecipeKwargs` to the `Accelerator` while setting `mixed_precision=\"fp8\"`. This is initial support, as it matures we will incorporate more into it (such as `accelerate config`\u002Fyaml) in future releases. See our benchmark examples [here](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Ftree\u002Fmain\u002Fbenchmarks\u002Ffp8\u002Ftorchao)\r\n\r\n## TensorParallel\r\n\r\nWe have intial support for an in-house solution to TP when working with accelerate dataloaders. check out the PR [here](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3173)\r\n\r\n## Bug fixes\r\n* fix triton version check by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3345\r\n* fix torch_dtype in estimate memory by @SunMarc in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3383\r\n* works for fp8 with deepspeed by @XiaobingSuper in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3361\r\n* [`memory leak`] Replace GradientState -> DataLoader reference with weakrefs by @tomaarsen in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3391\r\n\r\n## What's Changed\r\n* fix triton version check by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3345\r\n* [tests] enable BNB test cases in `tests\u002Ftest_quantization.py` on XPU by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3349\r\n* [Dev] Update release directions by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3352\r\n* [tests] make cuda-only test work on other hardware accelerators by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3302\r\n* [tests] remove `require_non_xpu` test markers by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3301\r\n* Support more functionalities for MUSA backend by @fmo-mt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3359\r\n* [tests] enable more bnb tests on XPU   by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3350\r\n* feat: support tensor parallel & Data loader by @kmehant in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3173\r\n* DeepSpeed github repo move sync by @stas00 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3376\r\n* [tests] Fix bnb cpu error by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3351\r\n* fix torch_dtype in estimate memory by @SunMarc in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3383\r\n* works for fp8 with deepspeed by @XiaobingSuper in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3361\r\n* fix: typos in documentation files by @maximevtush in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3388\r\n* [examples] upgrade code for seed setting   by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3387\r\n* [`memory leak`] Replace GradientState -> DataLoader reference with weakrefs by @tomaarsen in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3391\r\n* add xpu check in `get_quantized_model_device_map` by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3397\r\n* Torchao float8 training by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3348\r\n\r\n## New Contributors\r\n* @kmehant made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3173\r\n* @XiaobingSuper made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3361\r\n* @maximevtush made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3388\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fcompare\u002Fv1.3.0...v1.4.0","2025-02-17T17:18:10",{"id":152,"version":153,"summary_zh":154,"released_at":155},315701,"v1.3.0","## Torch 2.0\r\nAs it's been ~2 years since torch 2.0 was first released, we are now requiring this as the **minimum version for Accelerate**, which similarly was done in `transformers` as of its last release.\r\n\r\n## Core\r\n* [docs] no hard-coding cuda by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3270\r\n* fix load_state_dict for npu by @ji-huazhong in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3211\r\n* Add `keep_torch_compile` param to `unwrap_model` and `extract_model_from_parallel` for distributed compiled model. by @ggoggam in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3282\r\n* [tests] make cuda-only test case device-agnostic by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3340\r\n* latest bnb no longer has optim_args attribute on optimizer by @winglian in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3311\r\n* add torchdata version check to avoid \"in_order\" error by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3344\r\n* [docs] fix typo, change \"backoff_filter\" to \"backoff_factor\" by @suchot in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3296\r\n* dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3346\r\n* feat(tpu): remove nprocs from xla.spawn by @tengomucho in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3324\r\n\r\n## Big Modeling\r\n* Fix test_nested_hook by @SunMarc in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3289\r\n* correct the return statement of _init_infer_auto_device_map by @Nech-C in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3279\r\n* Use torch.xpu.mem_get_info for XPU by @dvrogozh in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3275\r\n* Ensure that tied parameter is children of module by @pablomlago in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3327\r\n* Fix for offloading when using TorchAO >= 0.7.0 by @a-r-r-o-w in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3332\r\n* Fix offload generate tests by @SunMarc in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3334\r\n\r\n## Examples\r\n* Give example on how to handle gradient accumulation with cross-entropy by @ylacombe in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3193\r\n\r\n## Full Changelog\r\n## What's Changed\r\n* [docs] no hard-coding cuda by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3270\r\n* fix load_state_dict for npu by @ji-huazhong in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3211\r\n* Fix test_nested_hook by @SunMarc in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3289\r\n* correct the return statement of _init_infer_auto_device_map by @Nech-C in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3279\r\n* Give example on how to handle gradient accumulation with cross-entropy by @ylacombe in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3193\r\n* Use torch.xpu.mem_get_info for XPU by @dvrogozh in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3275\r\n* Add `keep_torch_compile` param to `unwrap_model` and `extract_model_from_parallel` for distributed compiled model. by @ggoggam in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3282\r\n* Ensure that tied parameter is children of module by @pablomlago in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3327\r\n* Bye bye torch \u003C2 by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3331\r\n* Fixup docker build err by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3333\r\n* feat(tpu): remove nprocs from xla.spawn by @tengomucho in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3324\r\n* Fix offload generate tests by @SunMarc in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3334\r\n* [tests] make cuda-only test case device-agnostic by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3340\r\n* latest bnb no longer has optim_args attribute on optimizer by @winglian in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3311\r\n* Fix for offloading when using TorchAO >= 0.7.0 by @a-r-r-o-w in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3332\r\n* add torchdata version check to avoid \"in_order\" error by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3344\r\n* [docs] fix typo, change \"backoff_filter\" to \"backoff_factor\" by @suchot in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3296\r\n* dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3346\r\n\r\n## New Contributors\r\n* @ylacombe made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3193\r\n* @ggoggam made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3282\r\n* @pablomlago made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3327\r\n* @tengomucho made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3324\r\n* @suchot made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3296\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fcompare\u002Fv1.2.1...v1.3.0","2025-01-17T15:56:18",{"id":157,"version":158,"summary_zh":159,"released_at":160},315702,"v1.2.1","* fix: add max_memory to _init_infer_auto_device_map's return statement in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3279 by @Nech-C \r\n* fix load_state_dict for npu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3211 by @statelesshz \r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fcompare\u002Fv1.2.0...v1.2.1","2024-12-13T18:56:09",{"id":162,"version":163,"summary_zh":164,"released_at":165},315703,"v1.2.0","## Core\r\n* enable `find_executable_batch_size` on XPU by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3236\r\n* Use `numpy._core` instead of `numpy.core` by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3247\r\n* Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3066\r\n* Allow for full dynamo config passed to Accelerator by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3251\r\n* [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3252\r\n* [`data_loader`] Optionally also propagate set_epoch to batch sampler by @tomaarsen in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3246\r\n* use XPU instead of GPU in the `accelerate config` prompt text by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3268\r\n\r\n## Big Modeling\r\n* Fix `align_module_device`, ensure only cpu tensors for `get_state_dict_offloaded_model` by @kylesayrs in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3217\r\n* Remove hook for bnb 4-bit  by @SunMarc in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3223\r\n* [docs] add instruction to install bnb on non-cuda devices by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3227\r\n* Take care of case when \"_tied_weights_keys\" is not an attribute by @fabianlim in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3226\r\n* Update deferring_execution.md by @max-yue in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3262\r\n* Revert default behavior of `get_state_dict_from_offload` by @kylesayrs in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3253\r\n* Fix: Resolve #3060, `preload_module_classes` is lost for nested modules by @wejoncy in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3248\r\n\r\n## DeepSpeed\r\n* Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3255\r\n* support for wrapped schedulefree optimizer when using deepspeed by @winglian in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3266\r\n\r\n## Documentation\r\n\r\n* Update code in tracking documentation  by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3235\r\n* Replaced set\u002Fcheck breakpoint with set\u002Fcheck trigger in the troubleshooting documentation by @relh in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3259\r\n\r\n* Update set-seed by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3228\r\n* Fix typo by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3221\r\n* Use real path for `checkpoint` by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3220\r\n* Fixed multiple typos for Tutorials and Guides docs by @henryhmko in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3274\r\n\r\n## New Contributors\r\n* @winglian made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3266\r\n* @max-yue made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3262\r\n* @as12138 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3261\r\n* @relh made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3259\r\n* @wejoncy made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3248\r\n* @henryhmko made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3274\r\n\r\n\r\n## Full Changelog\r\n* Fix `align_module_device`, ensure only cpu tensors for `get_state_dict_offloaded_model` by @kylesayrs in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3217\r\n* remove hook for bnb 4-bit  by @SunMarc in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3223\r\n* enable `find_executable_batch_size` on XPU by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3236\r\n* take care of case when \"_tied_weights_keys\" is not an attribute by @fabianlim in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3226\r\n* [docs] update code in tracking documentation  by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3235\r\n* Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3066\r\n* [`data_loader`] Optionally also propagate set_epoch to batch sampler by @tomaarsen in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3246\r\n* [docs] add instruction to install bnb on non-cuda devices by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3227\r\n* Use `numpy._core` instead of `numpy.core` by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3247\r\n* Allow for full dynamo config passed to Accelerator by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3251\r\n* [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3252\r\n* use XPU instead of GPU in the `accelerate config` prompt text by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3268\r\n* support for wrapped schedulefree optimizer when using deeps","2024-12-13T18:47:06",{"id":167,"version":168,"summary_zh":169,"released_at":170},315704,"v1.1.0","## Internals:\r\n* Allow for a `data_seed` argument in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3150\r\n* Trigger `weights_only=True` by default for all compatible objects when checkpointing and saving with `torch.save` in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3036\r\n* Handle negative values for `dim` input in `pad_across_processes` in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3114\r\n* Enable cpu bnb distributed lora finetune in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3159\r\n\r\n## DeepSpeed\r\n* Support torch dynamo for deepspeed>=0.14.4 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3069\r\n\r\n## Megatron\r\n* update Megatron-LM plugin code to version 0.8.0 or higher in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3174\r\n\r\n## Big Model Inference\r\n* New `has_offloaded_params` utility added in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3188\r\n\r\n## Examples\r\n* Florence2 distributed inference example in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3123\r\n\r\n## Full Changelog\r\n* Handle negative values for `dim` input in `pad_across_processes` by @mariusarvinte in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3114\r\n* Fixup DS issue with weakref by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3143\r\n* Refactor scaler to util by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3142\r\n* DS fix, continued by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3145\r\n* Florence2 distributed inference example by @hlky in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3123\r\n* POC: Allow for a `data_seed` by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3150\r\n* Adding multi gpu speech generation by @dame-cell in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3149\r\n* support torch dynamo for deepspeed>=0.14.4 by @oraluben in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3069\r\n* Fixup Zero3 + `save_model` by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3146\r\n* Trigger `weights_only=True` by default for all compatible objects by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3036\r\n* Remove broken dynamo test by @oraluben in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3155\r\n* fix version check bug in `get_xpu_available_memory` by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3165\r\n* enable cpu bnb distributed lora finetune by @jiqing-feng in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3159\r\n* [Utils] `has_offloaded_params` by @kylesayrs in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3188\r\n* fix bnb by @eljandoubi in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3186\r\n* [docs] update neptune API by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3181\r\n* docs: fix a wrong word in comment in src\u002Faccelerate\u002Faccelerate.py:1255 by @Rebornix-zero in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3183\r\n* [docs] use nn.module instead of tensor as model  by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3157\r\n* Fix typo by @kylesayrs in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3191\r\n* MLU devices : Checks if mlu is available via an cndev-based check which won't trigger the drivers and leave mlu by @huismiling in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3187\r\n* update Megatron-LM plugin code to version 0.8.0 or higher. by @eljandoubi in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3174\r\n* 🚨 🚨 🚨 Goodbye Python 3.8! 🚨 🚨 🚨  by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3194\r\n* Update transformers.deepspeed references from transformers 4.46.0 release by @loadams in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3196\r\n* eliminate dead code by @statelesshz in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3198\r\n* take `torch.nn.Module` model into account when moving to device   by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3167\r\n* [docs] add xpu part and fix bug in `torchrun`  by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3166\r\n* Models With Tied Weights Need Re-Tieing After FSDP Param Init by @fabianlim in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3154\r\n* add the missing xpu for local sgd by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3163\r\n* typo fix in big_modeling.py by @a-r-r-o-w in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3207\r\n* [Utils] `align_module_device` by @kylesayrs in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3204\r\n\r\n## New Contributors\r\n* @mariusarvinte made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3114\r\n* @hlky made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3123\r\n* @dame-cell made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3149\r\n* @kylesayrs made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3188\r\n* @eljandoubi made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3186\r\n* @Rebornix-zero made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggi","2024-11-01T15:30:17",{"id":172,"version":173,"summary_zh":174,"released_at":175},315705,"v1.0.1","## Bugfixes\r\n\r\n* Fixes an issue where the `auto` values were no longer being parsed when using [deepspeed](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3143)\r\n* Fixes a broken test in the deepspeed tests related to the [auto values](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3145)\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fcompare\u002Fv1.0.0...v1.0.1","2024-10-12T03:01:13",{"id":177,"version":178,"summary_zh":179,"released_at":180},315706,"v1.0.0","## 🚀 Accelerate 1.0 🚀 \r\n\r\nWith `accelerate` 1.0, we are officially stating that the core parts of the API are now \"stable\" and ready for the future of what the world of distributed training and PyTorch has to handle. With these release notes, we will focus first on the major breaking changes to get your code fixed, followed by what is new specifically between 0.34.0 and 1.0.\r\n\r\nTo read more, check out our official blog [here](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Faccelerate-v1)\r\n\r\n## Migration assistance\r\n\r\n* Passing in `dispatch_batches`, `split_batches`, `even_batches`, and `use_seedable_sampler` to the `Accelerator()` should now be handled by creating an `accelerate.utils.DataLoaderConfiguration()` and passing this to the `Accelerator()` instead (`Accelerator(dataloader_config=DataLoaderConfiguration(...))`)\r\n* `Accelerator().use_fp16` and `AcceleratorState().use_fp16` have been removed; this should be replaced by checking `accelerator.mixed_precision == \"fp16\"`\r\n* `Accelerator().autocast()` no longer accepts a `cache_enabled` argument. Instead, an `AutocastKwargs()` instance should be used which handles this flag (among others) passing it to the `Accelerator` (`Accelerator(kwargs_handlers=[AutocastKwargs(cache_enabled=True)])`)\r\n* `accelerate.utils.is_tpu_available` should be replaced with `accelerate.utils.is_torch_xla_available`\r\n* `accelerate.utils.modeling.shard_checkpoint` should be replaced with `split_torch_state_dict_into_shards` from the `huggingface_hub` library\r\n* `accelerate.tqdm.tqdm()` no longer accepts `True`\u002F`False` as the first argument, and instead, `main_process_only` should be passed in as a named argument\r\n\r\n## Multiple Model DeepSpeed Support\r\n\r\nAfter long request, we finally have multiple model DeepSpeed support in Accelerate! (though it is quite early still). Read the full tutorial [here](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Fv1.0.0\u002Fen\u002Fusage_guides\u002Fdeepspeed_multiple_model#using-multiple-models-with-deepspeed), however essentially:\r\n\r\nWhen using multiple models, a DeepSpeed plugin should be created for each model (and as a result, a separate config). a few examples are below:\r\n\r\n### Knowledge distillation\r\n\r\n(Where we train only one model, zero3, and another is used for inference, zero2)\r\n\r\n```python\r\nfrom accelerate import Accelerator\r\nfrom accelerate.utils import DeepSpeedPlugin\r\n\r\nzero2_plugin = DeepSpeedPlugin(hf_ds_config=\"zero2_config.json\")\r\nzero3_plugin = DeepSpeedPlugin(hf_ds_config=\"zero3_config.json\")\r\n\r\ndeepspeed_plugins = {\"student\": zero2_plugin, \"teacher\": zero3_plugin}\r\n\r\n\r\naccelerator = Accelerator(deepspeed_plugins=deepspeed_plugins)\r\n```\r\n\r\nTo then select which plugin to be used at a certain time (aka when calling `prepare`), we call `accelerator.state.select_deepspeed_plugin(\"name\"), where the first plugin is active by default:\r\n\r\n```python\r\naccelerator.state.select_deepspeed_plugin(\"student\")\r\nstudent_model, optimizer, scheduler = ...\r\nstudent_model, optimizer, scheduler, train_dataloader = accelerator.prepare(student_model, optimizer, scheduler, train_dataloader)\r\n\r\naccelerator.state.select_deepspeed_plugin(\"teacher\") # This will automatically enable zero init\r\nteacher_model = AutoModel.from_pretrained(...)\r\nteacher_model = accelerator.prepare(teacher_model)\r\n```\r\n\r\n### Multiple disjoint models\r\n\r\nFor disjoint models, separate accelerators should be used for each model, and their own `.backward()` should be called later:\r\n\r\n```python\r\nfor batch in dl:\r\n    outputs1 = first_model(**batch)\r\n    first_accelerator.backward(outputs1.loss)\r\n    first_optimizer.step()\r\n    first_scheduler.step()\r\n    first_optimizer.zero_grad()\r\n    \r\n    outputs2 = model2(**batch)\r\n    second_accelerator.backward(outputs2.loss)\r\n    second_optimizer.step()\r\n    second_scheduler.step()\r\n    second_optimizer.zero_grad()\r\n```\r\n\r\n## FP8\r\n\r\nWe've enabled MS-AMP support up to FSDP. At this time we are not going forward with implementing FSDP support with MS-AMP, due to design issues between both libraries that don't make them inter-op easily. \r\n\r\n## FSDP\r\n* Fixed FSDP auto_wrap using characters instead of full str for layers\r\n* Re-enable setting state dict type manually\r\n\r\n## Big Modeling\r\n* Removed cpu restriction for bnb training\r\n\r\n## What's Changed\r\n* Fix FSDP auto_wrap using characters instead of full str for layers by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3075\r\n* Allow DataLoaderAdapter subclasses to be pickled by implementing `__reduce__` by @byi8220 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3074\r\n* Fix three typos in src\u002Faccelerate\u002Fdata_loader.py by @xiabingquan in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3082\r\n* Re-enable setting state dict type by @muellerzr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3084\r\n* Support sequential cpu offloading with torchao quantized tensors by @a-r-r-o-w in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3085\r\n* fix bug in `_get_named_modules` by @faaany in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fpull\u002F3052\r\n* ","2024-10-07T15:42:18",{"id":182,"version":183,"summary_zh":184,"released_at":185},315707,"v0.34.1","## Bug fixes\r\n* Fixes an issue where processed `DataLoaders` could no longer be pickled in #3074 thanks to @byi8220 \r\n* Fixes an issue when using FSDP where `default_transformers_cls_names_to_wrap` would separate `_no_split_modules` by characters instead of keeping it as a list of layer names in #3075 \r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\u002Fcompare\u002Fv0.34.0...v0.34.1","2024-09-05T15:36:16",[187,199,207,216,224,233],{"id":188,"name":189,"github_repo":190,"description_zh":191,"stars":192,"difficulty_score":193,"last_commit_at":194,"category_tags":195,"status":52},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[196,51,197,198],"Agent","图像","数据工具",{"id":200,"name":201,"github_repo":202,"description_zh":203,"stars":204,"difficulty_score":193,"last_commit_at":205,"category_tags":206,"status":52},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[51,197,196],{"id":208,"name":209,"github_repo":210,"description_zh":211,"stars":212,"difficulty_score":42,"last_commit_at":213,"category_tags":214,"status":52},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159636,"2026-04-17T23:33:34",[51,196,215],"语言模型",{"id":217,"name":218,"github_repo":219,"description_zh":220,"stars":221,"difficulty_score":42,"last_commit_at":222,"category_tags":223,"status":52},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[51,197,196],{"id":225,"name":226,"github_repo":227,"description_zh":228,"stars":229,"difficulty_score":42,"last_commit_at":230,"category_tags":231,"status":52},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[232,196,197,51],"插件",{"id":234,"name":235,"github_repo":236,"description_zh":237,"stars":238,"difficulty_score":42,"last_commit_at":239,"category_tags":240,"status":52},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[232,51]]