[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-baofff--U-ViT":3,"tool-baofff--U-ViT":62},[4,18,26,35,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,2,"2026-04-10T11:39:34",[14,15,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":32,"last_commit_at":41,"category_tags":42,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[43,13,15,14],"插件",{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[52,15,13,14],"语言模型",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,61],"视频",{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":68,"readme_en":69,"readme_zh":70,"quickstart_zh":71,"use_case_zh":72,"hero_image_url":73,"owner_login":74,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":10,"env_os":94,"env_gpu":95,"env_ram":96,"env_deps":97,"category_tags":111,"github_topics":78,"view_count":32,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":113,"updated_at":114,"faqs":115,"releases":156},6356,"baofff\u002FU-ViT","U-ViT","A PyTorch implementation of the paper \"All are Worth Words: A ViT Backbone for Diffusion Models\".","U-ViT 是一个基于 PyTorch 实现的开源项目，旨在为扩散模型提供一种全新的视觉变换器（ViT）骨干网络架构。传统扩散模型多依赖基于卷积神经网络（CNN）的 U-Net 结构，而 U-ViT 创新地将时间步、条件信息和噪声图像块等所有输入统一视为\"Token\"进行处理，并引入了浅层与深层之间的“长跳跃连接”机制。\n\n这一设计有效解决了传统架构在生成高质量图像时的局限性。研究表明，长跳跃连接对提升模型性能和收敛速度至关重要，而 CNN U-Net 中常见的下采样和上采样操作并非总是必要。凭借这一架构，U-ViT 在无条件、类条件及文本到图像生成任务中表现卓越，甚至在 ImageNet 和 MS-COCO 数据集上刷新了多项纪录，其生成质量可媲美或超越同等规模的 CNN 模型。\n\nU-ViT 特别适合人工智能研究人员、深度学习开发者以及对生成式模型感兴趣的技术爱好者使用。代码库不仅提供了优化的注意力计算实现和多个预训练模型，还集成了混合精度训练、梯度检查点等高效技术，使得在有限算力（如仅用两张 A100 显卡）下训练高分辨率大模型成为可能。无论是用于学术探索还是构建多模态应用，U-","U-ViT 是一个基于 PyTorch 实现的开源项目，旨在为扩散模型提供一种全新的视觉变换器（ViT）骨干网络架构。传统扩散模型多依赖基于卷积神经网络（CNN）的 U-Net 结构，而 U-ViT 创新地将时间步、条件信息和噪声图像块等所有输入统一视为\"Token\"进行处理，并引入了浅层与深层之间的“长跳跃连接”机制。\n\n这一设计有效解决了传统架构在生成高质量图像时的局限性。研究表明，长跳跃连接对提升模型性能和收敛速度至关重要，而 CNN U-Net 中常见的下采样和上采样操作并非总是必要。凭借这一架构，U-ViT 在无条件、类条件及文本到图像生成任务中表现卓越，甚至在 ImageNet 和 MS-COCO 数据集上刷新了多项纪录，其生成质量可媲美或超越同等规模的 CNN 模型。\n\nU-ViT 特别适合人工智能研究人员、深度学习开发者以及对生成式模型感兴趣的技术爱好者使用。代码库不仅提供了优化的注意力计算实现和多个预训练模型，还集成了混合精度训练、梯度检查点等高效技术，使得在有限算力（如仅用两张 A100 显卡）下训练高分辨率大模型成为可能。无论是用于学术探索还是构建多模态应用，U-ViT 都为扩散模型的骨干网络研究提供了极具价值的参考与实践工具。","## U-ViT\u003Cbr> \u003Csub>\u003Csmall>Official PyTorch implementation of [All are Worth Words: A ViT Backbone for Diffusion Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.12152) (CVPR 2023)\u003C\u002Fsmall>\u003C\u002Fsub>\n\n\n💡Projects with U-ViT: \n* [UniDiffuser](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002Funidiffuser), a multi-modal large-scale diffusion model based on a 1B U-ViT, is open-sourced\n* [DPT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.10586), [code](https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FDPT), [demo](https:\u002F\u002Fml-gsai.github.io\u002FDPT-demo) a conditional diffusion model trained with 1 label\u002Fclass with SOTA SSL generation and classification results on ImageNet\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaofff_U-ViT_readme_c73e62e3ba1e.png\" alt=\"drawing\" width=\"400\"\u002F>\n\nVision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. \nWe design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. \nU-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens \nand employing long skip connections between shallow and deep layers. \nWe evaluate U-ViT in unconditional and class-conditional image generation, \nas well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. \nIn particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation \non ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing \nlarge external datasets during the training of generative models.\n\nOur results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.\n\n--------------------\n\n\n\nThis codebase implements the transformer-based backbone 📌*U-ViT*📌 for diffusion models, as introduced in the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.12152).\nU-ViT treats all inputs as tokens and employs long skip connections. *The long skip connections grealy promote the performance and the convergence speed*.\n\n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaofff_U-ViT_readme_7cb2648074cc.png\" alt=\"drawing\" width=\"400\"\u002F>\n\n\n💡This codebase contains:\n* An implementation of [U-ViT](libs\u002Fuvit.py) with optimized attention computation\n* Pretrained U-ViT models on common image generation benchmarks (CIFAR10, CelebA 64x64, ImageNet 64x64, ImageNet 256x256, ImageNet 512x512)\n* Efficient training scripts for [pixel-space diffusion models](train.py), [latent space diffusion models](train_ldm_discrete.py) and [text-to-image diffusion models](train_t2i_discrete.py)\n* Efficient evaluation scripts for [pixel-space diffusion models](eval.py) and [latent space diffusion models](eval_ldm_discrete.py) and [text-to-image diffusion models](eval_t2i_discrete.py)\n* A Colab notebook demo for sampling from U-ViT on ImageNet (FID=2.29) [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fbaofff\u002FU-ViT\u002Fblob\u002Fmain\u002FUViT_ImageNet_demo.ipynb)\n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaofff_U-ViT_readme_34688eece8e2.png\" alt=\"drawing\" width=\"800\"\u002F>\n\n\n💡This codebase supports useful techniques for efficient training and sampling of diffusion models:\n* Mixed precision training with the [huggingface accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) library (🥰automatically turned on)\n* Efficient attention computation with the [facebook xformers](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fxformers) library (needs additional installation)\n* Gradient checkpointing trick, which reduces ~65% memory (🥰automatically turned on)\n* With these techniques, we are able to train our largest U-ViT-H on ImageNet at high resolutions such as 256x256 and 512x512 using a large batch size of 1024 with *only 2 A100*❗\n\n\nTraining speed and memory of U-ViT-H\u002F2 on ImageNet 256x256 using a batch size of 128 with a A100:\n\n| mixed precision training | xformers | gradient checkpointing |  training speed   |    memory     |\n|:------------------------:|:--------:|:----------------------:|:-----------------:|:-------------:|\n|            ❌             |    ❌     |           ❌            |         -         | out of memory |\n|            ✔             |    ❌     |           ❌            | 0.97 steps\u002Fsecond |   78852 MB    |\n|            ✔             |    ✔     |           ❌            | 1.14 steps\u002Fsecond |   54324 MB    |\n|            ✔             |    ✔     |           ✔            | 0.87 steps\u002Fsecond |   18858 MB    |\n\n\n\n## Dependency\n\n```sh\npip install torch torchvision --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu116  # install torch-1.13.1\npip install accelerate==0.12.0 absl-py ml_collections einops wandb ftfy==6.1.1 transformers==4.23.1\n\n# xformers is optional, but it would greatly speed up the attention computation.\npip install -U xformers\npip install -U --pre triton\n```\n\n* This repo is based on [`timm==0.3.2`](https:\u002F\u002Fgithub.com\u002Frwightman\u002Fpytorch-image-models), for which a [fix](https:\u002F\u002Fgithub.com\u002Frwightman\u002Fpytorch-image-models\u002Fissues\u002F420#issuecomment-776459842) is needed to work with PyTorch 1.8.1+. (Perhaps other versions also work, but I haven't tested it.)\n* We highly suggest install [xformers](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fxformers), which would greatly speed up the attention computation for *both training and inference*.\n\n\n\n## Pretrained Models\n\n\n|                                                         Model                                                          |  FID  | training iterations | batch size |\n|:----------------------------------------------------------------------------------------------------------------------:|:-----:|:-------------------:|:----------:|\n|      [CIFAR10 (U-ViT-S\u002F2)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1yoYyuzR_hQYWU0mkTj659tMTnoCWCMv-\u002Fview?usp=share_link)      | 3.11  |        500K         |    128     |\n|   [CelebA 64x64 (U-ViT-S\u002F4)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F13YpbRtlqF1HDBNLNRlKxLTbKbKeLE06C\u002Fview?usp=share_link)    | 2.87  |        500K         |    128     |\n|  [ImageNet 64x64 (U-ViT-M\u002F4)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1igVgRY7-A0ZV3XqdNcMGOnIGOxKr9azv\u002Fview?usp=share_link)   | 5.85  |        300K         |    1024    |\n|  [ImageNet 64x64 (U-ViT-L\u002F4)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F19rmun-T7RwkNC1feEPWinIo-1JynpW7J\u002Fview?usp=share_link)   | 4.26  |        300K         |    1024    |\n| [ImageNet 256x256 (U-ViT-L\u002F2)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1w7T1hiwKODgkYyMH9Nc9JNUThbxFZgs3\u002Fview?usp=share_link)  | 3.40  |        300K         |    1024    |\n| [ImageNet 256x256 (U-ViT-H\u002F2)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F13StUdrjaaSXjfqqF7M47BzPyhMAArQ4u\u002Fview?usp=share_link)  | 2.29  |        500K         |    1024    |\n| [ImageNet 512x512 (U-ViT-L\u002F4)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1mkj4aN2utHMBTWQX9l1nYue9vleL7ZSB\u002Fview?usp=share_link)  | 4.67  |        500K         |    1024    |\n| [ImageNet 512x512 (U-ViT-H\u002F4)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1uegr2o7cuKXtf2akWGAN2Vnlrtw5YKQq\u002Fview?usp=share_link)  | 4.05  |        500K         |    1024    |\n|      [MS-COCO (U-ViT-S\u002F2)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F15JsZWRz2byYNU6K093et5e5Xqd4uwA8S\u002Fview?usp=share_link)      | 5.95  |         1M          |    256     |\n|   [MS-COCO (U-ViT-S\u002F2, Deep)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1gHRy8sn039Wy-iFL21wH8TiheHK8Ky71\u002Fview?usp=share_link)   | 5.48  |         1M          |    256     |\n\n\n\n## Preparation Before Training and Evaluation\n\n#### Autoencoder\nDownload `stable-diffusion` directory from this [link](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1yo-XhqbPue3rp5P57j6QbA5QZx6KybvP?usp=sharing) (which contains image autoencoders converted from [Stable Diffusion](https:\u002F\u002Fgithub.com\u002FCompVis\u002Fstable-diffusion)). \nPut the downloaded directory as `assets\u002Fstable-diffusion` in this codebase.\nThe autoencoders are used in latent diffusion models.\n\n#### Data\n* ImageNet 64x64: Put the standard ImageNet dataset (which contains the `train` and `val` directory) to `assets\u002Fdatasets\u002FImageNet`.\n* ImageNet 256x256 and ImageNet 512x512: Extract ImageNet features according to `scripts\u002Fextract_imagenet_feature.py`.\n* MS-COCO: Download COCO 2014 [training](http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Ftrain2014.zip), [validation](http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Fval2014.zip) data and [annotations](http:\u002F\u002Fimages.cocodataset.org\u002Fannotations\u002Fannotations_trainval2014.zip). Then extract their features according to `scripts\u002Fextract_mscoco_feature.py` `scripts\u002Fextract_test_prompt_feature.py` `scripts\u002Fextract_empty_feature.py`.\n\n#### Reference statistics for FID\nDownload `fid_stats` directory from this [link](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1yo-XhqbPue3rp5P57j6QbA5QZx6KybvP?usp=sharing) (which contains reference statistics for FID).\nPut the downloaded directory as `assets\u002Ffid_stats` in this codebase.\nIn addition to evaluation, these reference statistics are used to monitor FID during the training process.\n\n## Training\n\n\n\nWe use the [huggingface accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) library to help train with distributed data parallel and mixed precision. The following is the training command:\n```sh\n# the training setting\nnum_processes=2  # the number of gpus you have, e.g., 2\ntrain_script=train.py  # the train script, one of \u003Ctrain.py|train_ldm.py|train_ldm_discrete.py|train_t2i_discrete.py>\n                       # train.py: training on pixel space\n                       # train_ldm.py: training on latent space with continuous timesteps\n                       # train_ldm_discrete.py: training on latent space with discrete timesteps\n                       # train_t2i_discrete.py: text-to-image training on latent space\nconfig=configs\u002Fcifar10_uvit_small.py  # the training configuration\n                                      # you can change other hyperparameters by modifying the configuration file\n\n# launch training\naccelerate launch --multi_gpu --num_processes $num_processes --mixed_precision fp16 $train_script --config=$config\n```\n\n\nWe provide all commands to reproduce U-ViT training in the paper:\n```sh\n# CIFAR10 (U-ViT-S\u002F2)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 train.py --config=configs\u002Fcifar10_uvit_small.py\n\n# CelebA 64x64 (U-ViT-S\u002F4)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 train.py --config=configs\u002Fceleba64_uvit_small.py \n\n# ImageNet 64x64 (U-ViT-M\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train.py --config=configs\u002Fimagenet64_uvit_mid.py\n\n# ImageNet 64x64 (U-ViT-L\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train.py --config=configs\u002Fimagenet64_uvit_large.py\n\n# ImageNet 256x256 (U-ViT-L\u002F2)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train_ldm.py --config=configs\u002Fimagenet256_uvit_large.py\n\n# ImageNet 256x256 (U-ViT-H\u002F2)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train_ldm_discrete.py --config=configs\u002Fimagenet256_uvit_huge.py\n\n# ImageNet 512x512 (U-ViT-L\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train_ldm.py --config=configs\u002Fimagenet512_uvit_large.py\n\n# ImageNet 512x512 (U-ViT-H\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train_ldm_discrete.py --config=configs\u002Fimagenet512_uvit_huge.py\n\n# MS-COCO (U-ViT-S\u002F2)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 train_t2i_discrete.py --config=configs\u002Fmscoco_uvit_small.py\n\n# MS-COCO (U-ViT-S\u002F2, Deep)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 train_t2i_discrete.py --config=configs\u002Fmscoco_uvit_small.py --config.nnet.depth=16\n```\n\n\n\n## Evaluation (Compute FID)\n\nWe use the [huggingface accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) library for efficient inference with mixed precision and multiple gpus. The following is the evaluation command:\n```sh\n# the evaluation setting\nnum_processes=2  # the number of gpus you have, e.g., 2\neval_script=eval.py  # the evaluation script, one of \u003Ceval.py|eval_ldm.py|eval_ldm_discrete.py|eval_t2i_discrete.py>\n                     # eval.py: for models trained with train.py (i.e., pixel space models)\n                     # eval_ldm.py: for models trained with train_ldm.py (i.e., latent space models with continuous timesteps)\n                     # eval_ldm_discrete.py: for models trained with train_ldm_discrete.py (i.e., latent space models with discrete timesteps)\n                     # eval_t2i_discrete.py: for models trained with train_t2i_discrete.py (i.e., text-to-image models on latent space)\nconfig=configs\u002Fcifar10_uvit_small.py  # the training configuration\n\n# launch evaluation\naccelerate launch --multi_gpu --num_processes $num_processes --mixed_precision fp16 eval_script --config=$config\n```\nThe generated images are stored in a temperary directory, and will be deleted after evaluation. If you want to keep these images, set `--config.sample.path=\u002Fsave\u002Fdir`.\n\n\nWe provide all commands to reproduce FID results in the paper:\n```sh\n# CIFAR10 (U-ViT-S\u002F2)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 eval.py --config=configs\u002Fcifar10_uvit_small.py --nnet_path=cifar10_uvit_small.pth\n\n# CelebA 64x64 (U-ViT-S\u002F4)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 eval.py --config=configs\u002Fceleba64_uvit_small.py --nnet_path=celeba64_uvit_small.pth\n\n# ImageNet 64x64 (U-ViT-M\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval.py --config=configs\u002Fimagenet64_uvit_mid.py --nnet_path=imagenet64_uvit_mid.pth\n\n# ImageNet 64x64 (U-ViT-L\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval.py --config=configs\u002Fimagenet64_uvit_large.py --nnet_path=imagenet64_uvit_large.pth\n\n# ImageNet 256x256 (U-ViT-L\u002F2)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval_ldm.py --config=configs\u002Fimagenet256_uvit_large.py --nnet_path=imagenet256_uvit_large.pth\n\n# ImageNet 256x256 (U-ViT-H\u002F2)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval_ldm_discrete.py --config=configs\u002Fimagenet256_uvit_huge.py --nnet_path=imagenet256_uvit_huge.pth\n\n# ImageNet 512x512 (U-ViT-L\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval_ldm.py --config=configs\u002Fimagenet512_uvit_large.py --nnet_path=imagenet512_uvit_large.pth\n\n# ImageNet 512x512 (U-ViT-H\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval_ldm_discrete.py --config=configs\u002Fimagenet512_uvit_huge.py --nnet_path=imagenet512_uvit_huge.pth\n\n# MS-COCO (U-ViT-S\u002F2)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 eval_t2i_discrete.py --config=configs\u002Fmscoco_uvit_small.py --nnet_path=mscoco_uvit_small.pth\n\n# MS-COCO (U-ViT-S\u002F2, Deep)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 eval_t2i_discrete.py --config=configs\u002Fmscoco_uvit_small.py --config.nnet.depth=16 --nnet_path=mscoco_uvit_small_deep.pth\n```\n\n\n\n\n## References\nIf you find the code useful for your research, please consider citing\n```bib\n@inproceedings{bao2022all,\n  title={All are Worth Words: A ViT Backbone for Diffusion Models},\n  author={Bao, Fan and Nie, Shen and Xue, Kaiwen and Cao, Yue and Li, Chongxuan and Su, Hang and Zhu, Jun},\n  booktitle = {CVPR},\n  year={2023}\n}\n```\n\nThis implementation is based on\n* [Extended Analytic-DPM](https:\u002F\u002Fgithub.com\u002Fbaofff\u002FExtended-Analytic-DPM) (provide the FID reference statistics on CIFAR10 and CelebA 64x64)\n* [guided-diffusion](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fguided-diffusion) (provide the FID reference statistics on ImageNet)\n* [pytorch-fid](https:\u002F\u002Fgithub.com\u002Fmseitzer\u002Fpytorch-fid) (provide the official implementation of FID to PyTorch)\n* [dpm-solver](https:\u002F\u002Fgithub.com\u002FLuChengTHU\u002Fdpm-solver) (provide the sampler)\n","## U-ViT\u003Cbr> \u003Csub>\u003Csmall>[All are Worth Words: A ViT Backbone for Diffusion Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.12152)（CVPR 2023）的官方 PyTorch 实现\u003C\u002Fsmall>\u003C\u002Fsub>\n\n\n💡使用 U-ViT 的项目：\n* [UniDiffuser](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002Funidiffuser)，一个基于 1B 参数 U-ViT 的多模态大规模扩散模型，已开源\n* [DPT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.10586)，[代码](https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FDPT)，[演示](https:\u002F\u002Fml-gsai.github.io\u002FDPT-demo)，是一种仅使用 1 个标签\u002F类训练的条件扩散模型，在 ImageNet 上实现了 SOTA 的自监督学习生成与分类效果\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaofff_U-ViT_readme_c73e62e3ba1e.png\" alt=\"drawing\" width=\"400\"\u002F>\n\n视觉Transformer（ViT）在各类视觉任务中展现出巨大潜力，而基于卷积神经网络（CNN）的 U-Net 仍然在扩散模型领域占据主导地位。  \n我们设计了一种简单通用的基于 ViT 的架构（命名为 U-ViT），用于扩散模型的图像生成。  \nU-ViT 的特点在于将所有输入，包括时间、条件信息以及噪声图像块，都视为 token，并在浅层和深层之间采用长跳跃连接。  \n我们在无条件和类别条件图像生成以及文本到图像生成任务中对 U-ViT 进行了评估，结果表明，U-ViT 在性能上不逊于甚至优于同等规模的基于 CNN 的 U-Net。  \n尤其值得注意的是，使用 U-ViT 的潜在空间扩散模型在 ImageNet 256x256 上的类别条件图像生成任务中取得了创纪录的 FID 分数 2.29，在 MS-COCO 数据集上的文本到图像生成任务中则达到了 5.48，且这些方法在生成模型训练过程中并未使用大型外部数据集。\n\n我们的研究结果表明，对于基于扩散的图像建模而言，长跳跃连接至关重要，而 CNN 基础 U-Net 中的下采样和上采样操作并非总是必要的。我们相信，U-ViT 可以为未来扩散模型骨干网络的研究提供新的思路，并有助于大规模跨模态数据集上的生成建模任务。\n\n--------------------\n\n\n\n本代码库实现了扩散模型中的 Transformer 骨干 📌*U-ViT*📌，如 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.12152) 所介绍。U-ViT 将所有输入视为 token，并采用长跳跃连接。*长跳跃连接极大地提升了性能和收敛速度*。\n\n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaofff_U-ViT_readme_7cb2648074cc.png\" alt=\"drawing\" width=\"400\"\u002F>\n\n\n💡本代码库包含：\n* 经优化注意力计算的 [U-ViT](libs\u002Fuvit.py) 实现\n* 在常见图像生成基准数据集上预训练的 U-ViT 模型（CIFAR10、CelebA 64x64、ImageNet 64x64、ImageNet 256x256、ImageNet 512x512）\n* 针对 [像素空间扩散模型](train.py)、[潜在空间扩散模型](train_ldm_discrete.py) 和 [文本到图像扩散模型](train_t2i_discrete.py) 的高效训练脚本\n* 针对 [像素空间扩散模型](eval.py)、[潜在空间扩散模型](eval_ldm_discrete.py) 和 [文本到图像扩散模型](eval_t2i_discrete.py) 的高效评估脚本\n* 一个 Colab 笔记本演示，用于从 U-ViT 生成 ImageNet 样本（FID=2.29）[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fbaofff\u002FU-ViT\u002Fblob\u002Fmain\u002FUViT_ImageNet_demo.ipynb)\n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaofff_U-ViT_readme_34688eece8e2.png\" alt=\"drawing\" width=\"800\"\u002F>\n\n\n💡本代码库支持多种用于高效训练和采样的扩散模型技术：\n* 使用 [huggingface accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) 库进行混合精度训练（🥰自动开启）\n* 使用 [facebook xformers](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fxformers) 库实现高效的注意力计算（需额外安装）\n* 梯度检查点技术，可减少约 65% 的显存占用（🥰自动开启）\n* 结合这些技术，我们能够在仅配备 2 张 A100 显卡的情况下，以 1024 的大批次训练最大规模的 U-ViT-H 模型，分辨率高达 256x256 和 512x512❗\n\n\n使用一张 A100 显卡，在 ImageNet 256x256 上以 128 的批次训练 U-ViT-H\u002F2 时的训练速度和显存消耗：\n\n| 混合精度训练 | xformers | 梯度检查点 | 训练速度   | 显存     |\n|:------------------------:|:--------:|:----------------------:|:-----------------:|:-------------:|\n|            ❌             |    ❌     |           ❌            |         -         | 显存不足 |\n|            ✔             |    ❌     |           ❌            | 0.97 步\u002F秒 |   78852 MB    |\n|            ✔             |    ✔     |           ❌            | 1.14 步\u002F秒 |   54324 MB    |\n|            ✔             |    ✔     |           ✔            | 0.87 步\u002F秒 |   18858 MB    |\n\n\n\n## 依赖项\n\n```sh\npip install torch torchvision --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu116  # 安装 torch-1.13.1\npip install accelerate==0.12.0 absl-py ml_collections einops wandb ftfy==6.1.1 transformers==4.23.1\n\n# xformers 是可选的，但能显著加速注意力计算。\npip install -U xformers\npip install -U --pre triton\n```\n\n* 本仓库基于 [`timm==0.3.2`](https:\u002F\u002Fgithub.com\u002Frwightman\u002Fpytorch-image-models)，该版本需要应用一个 [修复](https:\u002F\u002Fgithub.com\u002Frwightman\u002Fpytorch-image-models\u002Fissues\u002F420#issuecomment-776459842)，才能与 PyTorch 1.8.1+ 兼容。（其他版本或许也能工作，但我尚未测试过。）\n* 我们强烈建议安装 [xformers](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fxformers)，它能极大提升 *训练和推理* 过程中的注意力计算速度。\n\n## 预训练模型\n\n\n|                                                         模型                                                          |  FID  | 训练迭代次数 | 批量大小 |\n|:----------------------------------------------------------------------------------------------------------------------:|:-----:|:-------------------:|:----------:|\n|      [CIFAR10 (U-ViT-S\u002F2)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1yoYyuzR_hQYWU0mkTj659tMTnoCWCMv-\u002Fview?usp=share_link)      | 3.11  |        50万         |    128     |\n|   [CelebA 64x64 (U-ViT-S\u002F4)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F13YpbRtlqF1HDBNLNRlKxLTbKbKeLE06C\u002Fview?usp=share_link)    | 2.87  |        50万         |    128     |\n|  [ImageNet 64x64 (U-ViT-M\u002F4)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1igVgRY7-A0ZV3XqdNcMGOnIGOxKr9azv\u002Fview?usp=share_link)   | 5.85  |        30万         |    1024    |\n|  [ImageNet 64x64 (U-ViT-L\u002F4)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F19rmun-T7RwkNC1feEPWinIo-1JynpW7J\u002Fview?usp=share_link)   | 4.26  |        30万         |    1024    |\n| [ImageNet 256x256 (U-ViT-L\u002F2)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1w7T1hiwKODgkYyMH9Nc9JNUThbxFZgs3\u002Fview?usp=share_link)  | 3.40  |        30万         |    1024    |\n| [ImageNet 256x256 (U-ViT-H\u002F2)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F13StUdrjaaSXjfqqF7M47BzPyhMAArQ4u\u002Fview?usp=share_link)  | 2.29  |        50万         |    1024    |\n| [ImageNet 512x512 (U-ViT-L\u002F4)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1mkj4aN2utHMBTWQX9l1nYue9vleL7ZSB\u002Fview?usp=share_link)  | 4.67  |        50万         |    1024    |\n| [ImageNet 512x512 (U-ViT-H\u002F4)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1uegr2o7cuKXtf2akWGAN2Vnlrtw5YKQq\u002Fview?usp=share_link)  | 4.05  |        50万         |    1024    |\n|      [MS-COCO (U-ViT-S\u002F2)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F15JsZWRz2byYNU6K093et5e5Xqd4uwA8S\u002Fview?usp=share_link)      | 5.95  |         100万          |    256     |\n|   [MS-COCO (U-ViT-S\u002F2, Deep)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1gHRy8sn039Wy-iFL21wH8TiheHK8Ky71\u002Fview?usp=share_link)   | 5.48  |         100万          |    256     |\n\n\n\n## 训练与评估前的准备\n\n#### 自编码器\n从该[链接](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1yo-XhqbPue3rp5P57j6QbA5QZx6KybvP?usp=sharing)下载 `stable-diffusion` 目录（其中包含由 [Stable Diffusion](https:\u002F\u002Fgithub.com\u002FCompVis\u002Fstable-diffusion) 转换而来的图像自编码器）。\n将下载的目录放置到本代码库中的 `assets\u002Fstable-diffusion` 文件夹下。\n这些自编码器用于潜在扩散模型中。\n\n#### 数据\n* ImageNet 64x64：将标准 ImageNet 数据集（包含 `train` 和 `val` 目录）放入 `assets\u002Fdatasets\u002FImageNet`。\n* ImageNet 256x256 和 ImageNet 512x512：根据 `scripts\u002Fextract_imagenet_feature.py` 提取 ImageNet 特征。\n* MS-COCO：下载 COCO 2014 [训练](http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Ftrain2014.zip)、[验证](http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Fval2014.zip)数据以及 [标注](http:\u002F\u002Fimages.cocodataset.org\u002Fannotations\u002Fannotations_trainval2014.zip)。然后分别使用 `scripts\u002Fextract_mscoco_feature.py`、`scripts\u002Fextract_test_prompt_feature.py` 和 `scripts\u002Fextract_empty_feature.py` 提取相关特征。\n\n#### FID 参考统计\n从该[链接](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1yo-XhqbPue3rp5P57j6QbA5QZx6KybvP?usp=sharing)下载 `fid_stats` 目录（其中包含 FID 的参考统计信息）。\n将下载的目录放置到本代码库中的 `assets\u002Ffid_stats` 文件夹下。\n除了用于评估外，这些参考统计信息还用于在训练过程中监控 FID。\n\n## 训练\n\n\n\n我们使用 [huggingface accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) 库来支持分布式数据并行和混合精度训练。以下是训练命令：\n```sh\n# 训练设置\nnum_processes=2  # 你拥有的 GPU 数量，例如 2\ntrain_script=train.py  # 训练脚本，可选 \u003Ctrain.py|train_ldm.py|train_ldm_discrete.py|train_t2i_discrete.py>\n                       # train.py：在像素空间进行训练\n                       # train_ldm.py：在潜在空间使用连续时间步长进行训练\n                       # train_ldm_discrete.py：在潜在空间使用离散时间步长进行训练\n                       # train_t2i_discrete.py：在潜在空间进行文本到图像的训练\nconfig=configs\u002Fcifar10_uvit_small.py  # 训练配置文件\n                                      # 你可以通过修改配置文件来调整其他超参数\n\n# 启动训练\naccelerate launch --multi_gpu --num_processes $num_processes --mixed_precision fp16 $train_script --config=$config\n```\n\n\n我们提供了论文中复现 U-ViT 训练的所有命令：\n```sh\n# CIFAR10 (U-ViT-S\u002F2)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 train.py --config=configs\u002Fcifar10_uvit_small.py\n\n# CelebA 64x64 (U-ViT-S\u002F4)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 train.py --config=configs\u002Fceleba64_uvit_small.py \n\n# ImageNet 64x64 (U-ViT-M\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train.py --config=configs\u002Fimagenet64_uvit_mid.py\n\n# ImageNet 64x64 (U-ViT-L\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train.py --config=configs\u002Fimagenet64_uvit_large.py\n\n# ImageNet 256x256 (U-ViT-L\u002F2)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train_ldm.py --config=configs\u002Fimagenet256_uvit_large.py\n\n# ImageNet 256x256 (U-ViT-H\u002F2)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train_ldm_discrete.py --config=configs\u002Fimagenet256_uvit_huge.py\n\n# ImageNet 512x512 (U-ViT-L\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train_ldm.py --config=configs\u002Fimagenet512_uvit_large.py\n\n# ImageNet 512x512 (U-ViT-H\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train_ldm_discrete.py --config=configs\u002Fimagenet512_uvit_huge.py\n\n# MS-COCO (U-ViT-S\u002F2)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 train_t2i_discrete.py --config=configs\u002Fmscoco_uvit_small.py\n\n# MS-COCO (U-ViT-S\u002F2, Deep)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 train_t2i_discrete.py --config=configs\u002Fmscoco_uvit_small.py --config.nnet.depth=16\n```\n\n\n\n## 评估（计算 FID）\n\n我们使用 [huggingface accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) 库来进行高效的混合精度和多 GPU 推理。以下是评估命令：\n```sh\n\n# 评估设置\nnum_processes=2  # 您拥有的 GPU 数量，例如 2\neval_script=eval.py  # 评估脚本，可选 \u003Ceval.py|eval_ldm.py|eval_ldm_discrete.py|eval_t2i_discrete.py>\n                     # eval.py: 用于使用 train.py 训练的模型（即像素空间模型）\n                     # eval_ldm.py: 用于使用 train_ldm.py 训练的模型（即具有连续时间步的潜在空间模型）\n                     # eval_ldm_discrete.py: 用于使用 train_ldm_discrete.py 训练的模型（即具有离散时间步的潜在空间模型）\n                     # eval_t2i_discrete.py: 用于使用 train_t2i_discrete.py 训练的模型（即潜在空间上的文本到图像模型）\nconfig=configs\u002Fcifar10_uvit_small.py  # 训练配置\n\n# 启动评估\naccelerate launch --multi_gpu --num_processes $num_processes --mixed_precision fp16 eval_script --config=$config\n```\n生成的图片会存储在一个临时目录中，评估结束后会被删除。如果您想保留这些图片，请设置 `--config.sample.path=\u002Fsave\u002Fdir`。\n\n\n我们提供了复现论文中 FID 结果的所有命令：\n```sh\n# CIFAR10 (U-ViT-S\u002F2)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 eval.py --config=configs\u002Fcifar10_uvit_small.py --nnet_path=cifar10_uvit_small.pth\n\n# CelebA 64x64 (U-ViT-S\u002F4)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 eval.py --config=configs\u002Fceleba64_uvit_small.py --nnet_path=celeba64_uvit_small.pth\n\n# ImageNet 64x64 (U-ViT-M\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval.py --config=configs\u002Fimagenet64_uvit_mid.py --nnet_path=imagenet64_uvit_mid.pth\n\n# ImageNet 64x64 (U-ViT-L\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval.py --config=configs\u002Fimagenet64_uvit_large.py --nnet_path=imagenet64_uvit_large.pth\n\n# ImageNet 256x256 (U-ViT-L\u002F2)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval_ldm.py --config=configs\u002Fimagenet256_uvit_large.py --nnet_path=imagenet256_uvit_large.pth\n\n# ImageNet 256x256 (U-ViT-H\u002F2)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval_ldm_discrete.py --config=configs\u002Fimagenet256_uvit_huge.py --nnet_path=imagenet256_uvit_huge.pth\n\n# ImageNet 512x512 (U-ViT-L\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval_ldm.py --config=configs\u002Fimagenet512_uvit_large.py --nnet_path=imagenet512_uvit_large.pth\n\n# ImageNet 512x512 (U-ViT-H\u002F4)\naccelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 eval_ldm_discrete.py --config=configs\u002Fimagenet512_uvit_huge.py --nnet_path=imagenet512_uvit_huge.pth\n\n# MS-COCO (U-ViT-S\u002F2)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 eval_t2i_discrete.py --config=configs\u002Fmscoco_uvit_small.py --nnet_path=mscoco_uvit_small.pth\n\n# MS-COCO (U-ViT-S\u002F2, Deep)\naccelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 eval_t2i_discrete.py --config=configs\u002Fmscoco_uvit_small.py --config.nnet.depth=16 --nnet_path=mscoco_uvit_small_deep.pth\n```\n\n\n\n\n## 参考文献\n如果您觉得该代码对您的研究有帮助，请考虑引用以下文献：\n```bib\n@inproceedings{bao2022all,\n  title={All are Worth Words: A ViT Backbone for Diffusion Models},\n  author={Bao, Fan and Nie, Shen and Xue, Kaiwen and Cao, Yue and Li, Chongxuan and Su, Hang and Zhu, Jun},\n  booktitle = {CVPR},\n  year={2023}\n}\n```\n\n本实现基于以下项目：\n* [Extended Analytic-DPM](https:\u002F\u002Fgithub.com\u002Fbaofff\u002FExtended-Analytic-DPM)（提供 CIFAR10 和 CelebA 64x64 的 FID 参考统计数据）\n* [guided-diffusion](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fguided-diffusion)（提供 ImageNet 的 FID 参考统计数据）\n* [pytorch-fid](https:\u002F\u002Fgithub.com\u002Fmseitzer\u002Fpytorch-fid)（提供 FID 的 PyTorch 官方实现）\n* [dpm-solver](https:\u002F\u002Fgithub.com\u002FLuChengTHU\u002Fdpm-solver)（提供采样器）","# U-ViT 快速上手指南\n\nU-ViT 是一个基于 Vision Transformer (ViT) 的扩散模型骨干网络架构。它将时间、条件和噪声图像块均视为 Token，并采用长跳跃连接（Long Skip Connections），在图像生成任务中表现优异，尤其在 ImageNet 和 MS-COCO 数据集上取得了领先的 FID 分数。\n\n## 1. 环境准备\n\n*   **操作系统**: Linux (推荐)\n*   **Python**: 3.8+\n*   **GPU**: 支持 CUDA 的 NVIDIA GPU (训练高分辨率模型如 ImageNet 256x256+ 建议使用 A100 或多卡环境)\n*   **PyTorch**: 1.13.1+ (需匹配 CUDA 版本，官方示例基于 cu116)\n\n## 2. 安装步骤\n\n### 2.1 安装基础依赖\n首先安装 PyTorch 及核心依赖库。国内用户可使用清华或阿里镜像加速下载。\n\n```bash\n# 安装 PyTorch (以 CUDA 11.6 为例，其他版本请参照 pytorch 官网)\npip install torch torchvision --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu116\n\n# 安装其他核心依赖\npip install accelerate==0.12.0 absl-py ml_collections einops wandb ftfy==6.1.1 transformers==4.23.1 timm==0.3.2\n```\n\n> **注意**: `timm` 库可能需要应用一个补丁以兼容 PyTorch 1.8.1+，如遇报错请参考原仓库 issue #420。\n\n### 2.2 安装加速组件（强烈推荐）\n安装 `xformers` 可显著加速注意力计算并降低显存占用，是训练大模型的关键。\n\n```bash\n# 安装 xformers 和 triton\npip install -U xformers\npip install -U --pre triton\n```\n\n### 2.3 准备预训练资源与数据\n在运行训练或评估前，需下载以下资源并放置于项目根目录的 `assets\u002F` 文件夹下：\n\n1.  **自编码器 (Autoencoder)**: 用于潜在空间扩散模型 (LDM)。\n    *   下载 `stable-diffusion` 目录，重命名为 `assets\u002Fstable-diffusion`。\n2.  **FID 统计参考值**: 用于评估和训练监控。\n    *   下载 `fid_stats` 目录，重命名为 `assets\u002Ffid_stats`。\n3.  **数据集**:\n    *   **ImageNet 64x64**: 放入 `assets\u002Fdatasets\u002FImageNet`。\n    *   **ImageNet 256\u002F512 & MS-COCO**: 需先下载原始数据，然后运行 `scripts\u002F` 下的特征提取脚本预处理。\n\n## 3. 基本使用\n\n本项目使用 Hugging Face `accelerate` 库进行分布式训练和混合精度推理。\n\n### 3.1 启动训练\n以下命令展示了如何启动一个典型的训练任务（以 CIFAR10 为例）。你可以根据需求更换配置文件 (`--config`) 和训练脚本。\n\n```bash\n# 设置 GPU 数量 (例如 4 卡)\nnum_processes=4\n\n# 启动训练 (CIFAR10, U-ViT-S\u002F2)\n# train.py 用于像素空间训练\n# train_ldm_discrete.py 用于潜在空间离散时间步训练\n# train_t2i_discrete.py 用于文生图训练\naccelerate launch --multi_gpu --num_processes $num_processes --mixed_precision fp16 train.py --config=configs\u002Fcifar10_uvit_small.py\n```\n\n**常用训练配置示例：**\n*   **ImageNet 256x256 (U-ViT-H\u002F2)**:\n    ```bash\n    accelerate launch --multi_gpu --num_processes 8 --mixed_precision fp16 train_ldm_discrete.py --config=configs\u002Fimagenet256_uvit_huge.py\n    ```\n*   **MS-COCO 文生图**:\n    ```bash\n    accelerate launch --multi_gpu --num_processes 4 --mixed_precision fp16 train_t2i_discrete.py --config=configs\u002Fmscoco_uvit_small.py\n    ```\n\n### 3.2 模型评估 (计算 FID)\n使用提供的评估脚本计算生成图像的 FID 分数。\n\n```bash\n# 设置 GPU 数量\nnum_processes=2\n\n# 评估像素空间模型\naccelerate launch --multi_gpu --num_processes $num_processes --mixed_precision fp16 eval.py --config=configs\u002Fcifar10_uvit_small.py --ckpt_path=\u003C路径到检查点文件>\n\n# 评估潜在空间模型 (LDM)\naccelerate launch --multi_gpu --num_processes $num_processes --mixed_precision fp16 eval_ldm_discrete.py --config=configs\u002Fimagenet256_uvit_huge.py --ckpt_path=\u003C路径到检查点文件>\n```\n\n### 3.3 快速体验 (Colab)\n如果你没有本地 GPU 环境，可以直接使用官方提供的 Colab Notebook 在云端采样 ImageNet 图像 (FID=2.29 模型)：\n\n[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fbaofff\u002FU-ViT\u002Fblob\u002Fmain\u002FUViT_ImageNet_demo.ipynb)","某 AI 实验室团队正致力于构建一个高质量的电商商品图生成系统，需要在有限的算力资源下训练出高分辨率、高保真的扩散模型。\n\n### 没有 U-ViT 时\n- **架构瓶颈明显**：依赖传统 CNN 基础的 U-Net 架构，难以有效捕捉图像全局语义信息，导致生成的商品细节（如纹理、Logo）经常模糊或变形。\n- **硬件门槛过高**：若要训练 256x256 或更高分辨率的模型，通常需要耗费数张高端 GPU（如 8 张 A100），中小团队难以承担昂贵的算力成本。\n- **收敛速度慢**：模型训练周期漫长，且浅层与深层特征交互不足，导致生成图像的 FID 分数（衡量图像质量指标）迟迟无法突破 3.0，达不到商用标准。\n- **显存占用巨大**：未采用高效的梯度检查点技术，大批次训练时显存极易溢出，迫使开发人员不断缩小 Batch Size，进一步拖慢训练效率。\n\n### 使用 U-ViT 后\n- **全局感知增强**：U-ViT 将时间、条件及噪声图像块统一视为 Token，并利用长跳跃连接（Long Skip Connections），显著提升了商品细节的清晰度和结构一致性。\n- **算力成本骤降**：凭借混合精度训练和梯度检查点优化，仅需 2 张 A100 显卡即可支撑 1024 的大批次高分辨率训练，大幅降低了硬件投入。\n- **生成质量破纪录**：在 ImageNet 256x256 任务上实现了 2.29 的超低 FID 分数，生成的商品图逼真度媲美甚至超越大型卷积模型，直接满足上线需求。\n- **训练效率飞跃**：优化的注意力计算机制加速了模型收敛，让团队能在更短时间内完成多轮迭代，快速验证新的创意提示词（Prompt）。\n\nU-ViT 通过革新性的 Transformer 架构与极致的工程优化，让中小团队也能以低成本打造出业界顶尖的图像生成能力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaofff_U-ViT_24778cc8.png","baofff","Fan Bao","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fbaofff_c74da1d8.jpg","Ph.D. student in Tsinghua University.",null,"https:\u002F\u002Fbaofff.github.io\u002F","https:\u002F\u002Fgithub.com\u002Fbaofff",[82,86],{"name":83,"color":84,"percentage":85},"Jupyter Notebook","#DA5B0B",95.1,{"name":87,"color":88,"percentage":89},"Python","#3572A5",4.9,1106,78,"2026-04-07T18:20:10","MIT","Linux","必需 NVIDIA GPU。官方测试基于 A100，支持混合精度训练和梯度检查点以优化显存。训练大型模型（如 ImageNet 256x256 U-ViT-H）在开启优化后仅需 2 张 A100；未开启优化时显存需求极高（>78GB）。需安装 CUDA 11.6 (cu116)。","未说明",{"notes":98,"python":99,"dependencies":100},"1. 必须安装特定版本的 PyTorch (1.13.1 + cu116)。2. 强烈建议安装 xformers 库以大幅加速注意力计算并降低显存占用。3. 代码基于 timm==0.3.2，在 PyTorch 1.8.1+ 环境下可能需要应用特定修复补丁。4. 训练前需手动下载 Stable Diffusion 的自动编码器权重、ImageNet\u002FCOCO 数据集特征文件以及 FID 评估所需的参考统计数据，并放置于指定目录。5. 使用 HuggingFace Accelerate 进行分布式训练和混合精度控制。","未说明 (依赖 PyTorch 1.13.1)",[101,102,103,104,105,106,107,108,109,110],"torch==1.13.1","torchvision","accelerate==0.12.0","xformers (可选但强烈推荐)","triton","transformers==4.23.1","timm==0.3.2","absl-py","ml_collections","einops",[15,112],"其他","2026-03-27T02:49:30.150509","2026-04-11T03:24:36.652478",[116,121,126,131,136,141,146,151],{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},28773,"如何复现论文中的 FID 分数？为什么我得到的结果比论文差？","确保使用正确的配置参数，特别是无条件引导比例 `p_uncond`。例如在 ImageNet256_uvit_L\u002F2 中，代码配置应为 `p_uncond=0.15`（而非论文中的 0.1）。此外，训练迭代次数需足够（如 ImageNet 约 500k 次迭代），并使用 EMA 模型进行评估。若遵循这些设置，可复现出接近论文的 FID 分数（如 3.52）。","https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT\u002Fissues\u002F14",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},28774,"评估模型时应该使用 `nnet.pth` 还是 `nnet_ema.pth`？","对于扩散模型（如 DDPM 和 U-ViT），使用指数移动平均（EMA）模型通常能获得更好的结果。论文中报告的 FID 分数也是基于 EMA 模型评估得出的。使用 `nnet_ema.pth` 进行评估，FID 分数可能从非 EMA 模型的 17 提升至 7 左右。","https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT\u002Fissues\u002F20",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},28775,"如何将配置从 'noise_pred' 改为 'x0_pred' 时修复维度不匹配的错误？","该错误是因为 PyTorch 将时间步 `t` 与输入 `xt` 的第一轴相乘时，维度不匹配（例如 500 vs 32）。维护者已修复此 Bug。如果遇到类似问题，请检查 `t` 和 `xt` 的形状，确保广播机制正确，或更新到最新代码版本以获取修复。","https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT\u002Fissues\u002F3",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},28776,"运行 `extract_test_prompt_feature.py` 脚本报错 'ValueError: setting an array element with a sequence' 如何解决？","这是因为 numpy 无法直接保存包含不规则形状序列的数组。解决方法是修改保存和读取代码：\n1. 保存时改用 `np.savez`：`np.savez(os.path.join(save_dir, str(i)), prompt=prompts[i], context=c)`\n2. 读取时（在 `datasets.py` 中）改为：\n```python\n_data = np.load(os.path.join(path, 'run_vis', f), allow_pickle=True)\nprompt, context = _data['prompt'], _data['context']\n```","https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT\u002Fissues\u002F19",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},28777,"训练初期 Loss 没有下降甚至波动，这是正常现象吗？","是的，这在扩散模型训练早期是正常的。特别是在大规模数据集（如 ImageNet 256）上训练大型模型（如 uvit_huge）的前 30k 次迭代内，Loss 可能快速下降后进入平台期。在训练的中后期，Loss 下降非常缓慢，肉眼难以观察。建议通过可视化生成样本的质量或使用 FID 等定量指标来监控训练进度，而不是仅依赖 Loss 曲线。","https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT\u002Fissues\u002F13",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},28778,"如何计算 U-ViT 模型的 GFLOPs？","可以使用 `facebookresearch\u002Ffvcore` 库来计算。示例代码如下：\n```python\nfrom fvcore.nn import FlopCountAnalysis\nfrom libs.uvit import UViT\nimport torch\n\nmodel = UViT(\n  img_size=32,\n  patch_size=2,\n  in_chans=4,\n  embed_dim=1152,\n  depth=28,\n  num_heads=16,\n  mlp_ratio=4,\n  qkv_bias=False,\n  mlp_time_embed=False,\n  num_classes=1001,\n  use_checkpoint=True,\n  conv=False\n)\n\ninputs = (torch.rand(1, 4, 32, 32), torch.tensor([1]), torch.tensor([1]))\nflops = FlopCountAnalysis(model, inputs)\nprint(\"FLOPS: \", f'{flops.total() \u002F 1e9}G')\n```","https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT\u002Fissues\u002F2",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},28779,"如何运行文本生成图像（Text-to-Image）的演示？","项目提供了脚本 `sample_t2i_discrete.py` 用于演示。使用方法如下：\n```bash\npython sample_t2i_discrete.py --config=configs\u002Fmscoco_uvit_small.py --config.nnet.depth=16 --nnet_path=mscoco_uvit_small_deep.pth --input_path=prompts.txt --output_path=the_directory_to_put_generated_samples\n```\n其中 `prompts.txt` 包含输入的文本提示，输出将保存在指定目录。","https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT\u002Fissues\u002F1",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},28780,"运行训练脚本时报错 'AttributeError: GradientState object has no attribute _iterate_samples_seen' 怎么办？","这通常是由于 `accelerate` 或 `torch` 版本不兼容导致的。即使 README 中标注了版本，Conda 也可能自动安装更新的 PyTorch 版本（如 2.3.0），从而引发此错误。解决方法是显式安装兼容的旧版本 PyTorch，例如：\n```bash\npip install torch==1.13.1 torchvision --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu116\n```","https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT\u002Fissues\u002F22",[]]