[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-CompVis--latent-diffusion":3,"tool-CompVis--latent-diffusion":64},[4,17,26,35,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,2,"2026-04-03T11:11:01",[13,14,15],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":10,"last_commit_at":32,"category_tags":33,"status":16},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[13,14,15,34],"视频",{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,43,34,44,15,45,46,13,47],"数据工具","插件","其他","语言模型","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,46,45],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74939,"2026-04-05T23:16:38",[46,14,13,45],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":80,"owner_website":82,"owner_url":83,"languages":84,"stars":97,"forks":98,"last_commit_at":99,"license":100,"difficulty_score":10,"env_os":101,"env_gpu":102,"env_ram":103,"env_deps":104,"category_tags":118,"github_topics":80,"view_count":23,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":119,"updated_at":120,"faqs":121,"releases":150},4196,"CompVis\u002Flatent-diffusion","latent-diffusion","High-Resolution Image Synthesis with Latent Diffusion Models","latent-diffusion 是一个专注于高分辨率图像生成的开源深度学习框架，其核心基于潜在扩散模型（Latent Diffusion Models, LDM）。它主要解决了传统扩散模型在生成高清图像时计算成本极高、推理速度慢的难题。通过在压缩的潜在空间而非原始像素空间进行扩散过程，latent-diffusion 在大幅降低显存需求和提升运算效率的同时，依然能保持卓越的图像生成质量。\n\n该项目不仅提供了强大的文生图能力，还支持类条件生成及检索增强生成等多种模式。其独特的技术亮点在于高效的潜在空间操作机制，以及后来集成的无分类器引导（classifier-free guidance）和 PLMS 采样器，这些改进进一步提升了生成速度与效果可控性。此外，项目开源了多个预训练模型，包括在大规模 LAION 数据集上训练的 14.5 亿参数模型，方便用户直接调用或微调。\n\nlatent-diffusion 非常适合 AI 研究人员探索生成模型架构，开发者构建自定义图像应用，以及设计师寻找高效的创意辅助工具。对于希望深入理解扩散模型原理并动手实践的技术爱好者来说，这也是一个极具价值的学习资","latent-diffusion 是一个专注于高分辨率图像生成的开源深度学习框架，其核心基于潜在扩散模型（Latent Diffusion Models, LDM）。它主要解决了传统扩散模型在生成高清图像时计算成本极高、推理速度慢的难题。通过在压缩的潜在空间而非原始像素空间进行扩散过程，latent-diffusion 在大幅降低显存需求和提升运算效率的同时，依然能保持卓越的图像生成质量。\n\n该项目不仅提供了强大的文生图能力，还支持类条件生成及检索增强生成等多种模式。其独特的技术亮点在于高效的潜在空间操作机制，以及后来集成的无分类器引导（classifier-free guidance）和 PLMS 采样器，这些改进进一步提升了生成速度与效果可控性。此外，项目开源了多个预训练模型，包括在大规模 LAION 数据集上训练的 14.5 亿参数模型，方便用户直接调用或微调。\n\nlatent-diffusion 非常适合 AI 研究人员探索生成模型架构，开发者构建自定义图像应用，以及设计师寻找高效的创意辅助工具。对于希望深入理解扩散模型原理并动手实践的技术爱好者来说，这也是一个极具价值的学习资源。通过简洁的环境配置和丰富的示例代码，用户可以快速上手体验前沿的图像合成技术。","# Latent Diffusion Models\n[arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10752) | [BibTeX](#bibtex)\n\n\u003Cp align=\"center\">\n\u003Cimg src=assets\u002Fresults.gif \u002F>\n\u003C\u002Fp>\n\n\n\n[**High-Resolution Image Synthesis with Latent Diffusion Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10752)\u003Cbr\u002F>\n[Robin Rombach](https:\u002F\u002Fgithub.com\u002Frromb)\\*,\n[Andreas Blattmann](https:\u002F\u002Fgithub.com\u002Fablattmann)\\*,\n[Dominik Lorenz](https:\u002F\u002Fgithub.com\u002Fqp-qp)\\,\n[Patrick Esser](https:\u002F\u002Fgithub.com\u002Fpesser),\n[Björn Ommer](https:\u002F\u002Fhci.iwr.uni-heidelberg.de\u002FStaff\u002Fbommer)\u003Cbr\u002F>\n\\* equal contribution\n\n\u003Cp align=\"center\">\n\u003Cimg src=assets\u002Fmodelfigure.png \u002F>\n\u003C\u002Fp>\n\n## News\n\n### July 2022\n- Inference code and model weights to run our [retrieval-augmented diffusion models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.11824) are now available. See [this section](#retrieval-augmented-diffusion-models).\n### April 2022\n- Thanks to [Katherine Crowson](https:\u002F\u002Fgithub.com\u002Fcrowsonkb), classifier-free guidance received a ~2x speedup and the [PLMS sampler](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.09778) is available. See also [this PR](https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion\u002Fpull\u002F51).\n\n- Our 1.45B [latent diffusion LAION model](#text-to-image) was integrated into [Huggingface Spaces 🤗](https:\u002F\u002Fhuggingface.co\u002Fspaces) using [Gradio](https:\u002F\u002Fgithub.com\u002Fgradio-app\u002Fgradio). Try out the Web Demo: [![Hugging Face Spaces](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmultimodalart\u002Flatentdiffusion)\n\n- More pre-trained LDMs are available: \n  - A 1.45B [model](#text-to-image) trained on the [LAION-400M](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.02114) database.\n  - A class-conditional model on ImageNet, achieving a FID of 3.6 when using [classifier-free guidance](https:\u002F\u002Fopenreview.net\u002Fpdf?id=qw8AKxfYbI) Available via a [colab notebook](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FCompVis\u002Flatent-diffusion\u002Fblob\u002Fmain\u002Fscripts\u002Flatent_imagenet_diffusion.ipynb) [![][colab]][colab-cin].\n  \n## Requirements\nA suitable [conda](https:\u002F\u002Fconda.io\u002F) environment named `ldm` can be created\nand activated with:\n\n```\nconda env create -f environment.yaml\nconda activate ldm\n```\n\n# Pretrained Models\nA general list of all available checkpoints is available in via our [model zoo](#model-zoo).\nIf you use any of these models in your work, we are always happy to receive a [citation](#bibtex).\n\n## Retrieval Augmented Diffusion Models\n![rdm-figure](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_cc5bccf3b904.jpg)\nWe include inference code to run our retrieval-augmented diffusion models (RDMs) as described in [https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.11824](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.11824).\n\n\nTo get started, install the additionally required python packages into your `ldm` environment\n```shell script\npip install transformers==4.19.2 scann kornia==0.6.4 torchmetrics==0.6.0\npip install git+https:\u002F\u002Fgithub.com\u002Farogozhnikov\u002Feinops.git\n```\nand download the trained weights (preliminary ceckpoints):\n\n```bash\nmkdir -p models\u002Frdm\u002Frdm768x768\u002F\nwget -O models\u002Frdm\u002Frdm768x768\u002Fmodel.ckpt https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Frdm\u002Fmodel.ckpt\n```\nAs these models are conditioned on a set of CLIP image embeddings, our RDMs support different inference modes, \nwhich are described in the following.\n#### RDM with text-prompt only (no explicit retrieval needed)\nSince CLIP offers a shared image\u002Ftext feature space, and RDMs learn to cover a neighborhood of a given\nexample during training, we can directly take a CLIP text embedding of a given prompt and condition on it.\nRun this mode via\n```\npython scripts\u002Fknn2img.py  --prompt \"a happy bear reading a newspaper, oil on canvas\"\n```\n\n#### RDM with text-to-image retrieval\n\nTo be able to run a RDM conditioned on a text-prompt and additionally images retrieved from this prompt, you will also need to download the corresponding retrieval database. \nWe provide two distinct databases extracted from the [Openimages-](https:\u002F\u002Fstorage.googleapis.com\u002Fopenimages\u002Fweb\u002Findex.html) and [ArtBench-](https:\u002F\u002Fgithub.com\u002Fliaopeiyuan\u002Fartbench) datasets. \nInterchanging the databases results in different capabilities of the model as visualized below, although the learned weights are the same in both cases. \n\nDownload the retrieval-databases which contain the retrieval-datasets ([Openimages](https:\u002F\u002Fstorage.googleapis.com\u002Fopenimages\u002Fweb\u002Findex.html) (~11GB) and [ArtBench](https:\u002F\u002Fgithub.com\u002Fliaopeiyuan\u002Fartbench) (~82MB)) compressed into CLIP image embeddings:\n```bash\nmkdir -p data\u002Frdm\u002Fretrieval_databases\nwget -O data\u002Frdm\u002Fretrieval_databases\u002Fartbench.zip https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Frdm\u002Fartbench_databases.zip\nwget -O data\u002Frdm\u002Fretrieval_databases\u002Fopenimages.zip https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Frdm\u002Fopenimages_database.zip\nunzip data\u002Frdm\u002Fretrieval_databases\u002Fartbench.zip -d data\u002Frdm\u002Fretrieval_databases\u002F\nunzip data\u002Frdm\u002Fretrieval_databases\u002Fopenimages.zip -d data\u002Frdm\u002Fretrieval_databases\u002F\n```\nWe also provide trained [ScaNN](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002Ftree\u002Fmaster\u002Fscann) search indices for ArtBench. Download and extract via\n```bash\nmkdir -p data\u002Frdm\u002Fsearchers\nwget -O data\u002Frdm\u002Fsearchers\u002Fartbench.zip https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Frdm\u002Fartbench_searchers.zip\nunzip data\u002Frdm\u002Fsearchers\u002Fartbench.zip -d data\u002Frdm\u002Fsearchers\n```\n\nSince the index for OpenImages is large (~21 GB), we provide a script to create and save it for usage during sampling. Note however,\nthat sampling with the OpenImages database will not be possible without this index. Run the script via\n```bash\npython scripts\u002Ftrain_searcher.py\n```\n\nRetrieval based text-guided sampling with visual nearest neighbors can be started via \n```\npython scripts\u002Fknn2img.py  --prompt \"a happy pineapple\" --use_neighbors --knn \u003Cnumber_of_neighbors> \n```\nNote that the maximum supported number of neighbors is 20. \nThe database can be changed via the cmd parameter ``--database`` which can be `[openimages, artbench-art_nouveau, artbench-baroque, artbench-expressionism, artbench-impressionism, artbench-post_impressionism, artbench-realism, artbench-renaissance, artbench-romanticism, artbench-surrealism, artbench-ukiyo_e]`.\nFor using `--database openimages`, the above script (`scripts\u002Ftrain_searcher.py`) must be executed before.\nDue to their relatively small size, the artbench datasetbases are best suited for creating more abstract concepts and do not work well for detailed text control. \n\n\n#### Coming Soon\n- better models\n- more resolutions\n- image-to-image retrieval\n\n## Text-to-Image\n![text2img-figure](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_8c52d997ba27.png) \n\n\nDownload the pre-trained weights (5.7GB)\n```\nmkdir -p models\u002Fldm\u002Ftext2img-large\u002F\nwget -O models\u002Fldm\u002Ftext2img-large\u002Fmodel.ckpt https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fnitro\u002Ftxt2img-f8-large\u002Fmodel.ckpt\n```\nand sample with\n```\npython scripts\u002Ftxt2img.py --prompt \"a virus monster is playing guitar, oil on canvas\" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0  --ddim_steps 50\n```\nThis will save each sample individually as well as a grid of size `n_iter` x `n_samples` at the specified output location (default: `outputs\u002Ftxt2img-samples`).\nQuality, sampling speed and diversity are best controlled via the `scale`, `ddim_steps` and `ddim_eta` arguments.\nAs a rule of thumb, higher values of `scale` produce better samples at the cost of a reduced output diversity.   \nFurthermore, increasing `ddim_steps` generally also gives higher quality samples, but returns are diminishing for values > 250.\nFast sampling (i.e. low values of `ddim_steps`) while retaining good quality can be achieved by using `--ddim_eta 0.0`.  \nFaster sampling (i.e. even lower values of `ddim_steps`) while retaining good quality can be achieved by using `--ddim_eta 0.0` and `--plms` (see [Pseudo Numerical Methods for Diffusion Models on Manifolds](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.09778)).\n\n#### Beyond 256²\n\nFor certain inputs, simply running the model in a convolutional fashion on larger features than it was trained on\ncan sometimes result in interesting results. To try it out, tune the `H` and `W` arguments (which will be integer-divided\nby 8 in order to calculate the corresponding latent size), e.g. run\n\n```\npython scripts\u002Ftxt2img.py --prompt \"a sunset behind a mountain range, vector image\" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0  \n```\nto create a sample of size 384x1024. Note, however, that controllability is reduced compared to the 256x256 setting. \n\nThe example below was generated using the above command. \n![text2img-figure-conv](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_898c1939c979.png)\n\n\n\n## Inpainting\n![inpainting](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_36fd1cf9d19d.png)\n\nDownload the pre-trained weights\n```\nwget -O models\u002Fldm\u002Finpainting_big\u002Flast.ckpt https:\u002F\u002Fheibox.uni-heidelberg.de\u002Ff\u002F4d9ac7ea40c64582b7c9\u002F?dl=1\n```\n\nand sample with\n```\npython scripts\u002Finpaint.py --indir data\u002Finpainting_examples\u002F --outdir outputs\u002Finpainting_results\n```\n`indir` should contain images `*.png` and masks `\u003Cimage_fname>_mask.png` like\nthe examples provided in `data\u002Finpainting_examples`.\n\n## Class-Conditional ImageNet\n\nAvailable via a [notebook](scripts\u002Flatent_imagenet_diffusion.ipynb) [![][colab]][colab-cin].\n![class-conditional](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_003fc4375cbf.png)\n\n[colab]: \u003Chttps:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg>\n[colab-cin]: \u003Chttps:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FCompVis\u002Flatent-diffusion\u002Fblob\u002Fmain\u002Fscripts\u002Flatent_imagenet_diffusion.ipynb>\n\n\n## Unconditional Models\n\nWe also provide a script for sampling from unconditional LDMs (e.g. LSUN, FFHQ, ...). Start it via\n\n```shell script\nCUDA_VISIBLE_DEVICES=\u003CGPU_ID> python scripts\u002Fsample_diffusion.py -r models\u002Fldm\u002F\u003Cmodel_spec>\u002Fmodel.ckpt -l \u003Clogdir> -n \u003C\\#samples> --batch_size \u003Cbatch_size> -c \u003C\\#ddim steps> -e \u003C\\#eta> \n```\n\n# Train your own LDMs\n\n## Data preparation\n\n### Faces \nFor downloading the CelebA-HQ and FFHQ datasets, proceed as described in the [taming-transformers](https:\u002F\u002Fgithub.com\u002FCompVis\u002Ftaming-transformers#celeba-hq) \nrepository.\n\n### LSUN \n\nThe LSUN datasets can be conveniently downloaded via the script available [here](https:\u002F\u002Fgithub.com\u002Ffyu\u002Flsun).\nWe performed a custom split into training and validation images, and provide the corresponding filenames\nat [https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flsun.zip](https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flsun.zip). \nAfter downloading, extract them to `.\u002Fdata\u002Flsun`. The beds\u002Fcats\u002Fchurches subsets should\nalso be placed\u002Fsymlinked at `.\u002Fdata\u002Flsun\u002Fbedrooms`\u002F`.\u002Fdata\u002Flsun\u002Fcats`\u002F`.\u002Fdata\u002Flsun\u002Fchurches`, respectively.\n\n### ImageNet\nThe code will try to download (through [Academic\nTorrents](http:\u002F\u002Facademictorrents.com\u002F)) and prepare ImageNet the first time it\nis used. However, since ImageNet is quite large, this requires a lot of disk\nspace and time. If you already have ImageNet on your disk, you can speed things\nup by putting the data into\n`${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_{split}\u002Fdata\u002F` (which defaults to\n`~\u002F.cache\u002Fautoencoders\u002Fdata\u002FILSVRC2012_{split}\u002Fdata\u002F`), where `{split}` is one\nof `train`\u002F`validation`. It should have the following structure:\n\n```\n${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_{split}\u002Fdata\u002F\n├── n01440764\n│   ├── n01440764_10026.JPEG\n│   ├── n01440764_10027.JPEG\n│   ├── ...\n├── n01443537\n│   ├── n01443537_10007.JPEG\n│   ├── n01443537_10014.JPEG\n│   ├── ...\n├── ...\n```\n\nIf you haven't extracted the data, you can also place\n`ILSVRC2012_img_train.tar`\u002F`ILSVRC2012_img_val.tar` (or symlinks to them) into\n`${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_train\u002F` \u002F\n`${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_validation\u002F`, which will then be\nextracted into above structure without downloading it again.  Note that this\nwill only happen if neither a folder\n`${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_{split}\u002Fdata\u002F` nor a file\n`${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_{split}\u002F.ready` exist. Remove them\nif you want to force running the dataset preparation again.\n\n\n## Model Training\n\nLogs and checkpoints for trained models are saved to `logs\u002F\u003CSTART_DATE_AND_TIME>_\u003Cconfig_spec>`.\n\n### Training autoencoder models\n\nConfigs for training a KL-regularized autoencoder on ImageNet are provided at `configs\u002Fautoencoder`.\nTraining can be started by running\n```\nCUDA_VISIBLE_DEVICES=\u003CGPU_ID> python main.py --base configs\u002Fautoencoder\u002F\u003Cconfig_spec>.yaml -t --gpus 0,    \n```\nwhere `config_spec` is one of {`autoencoder_kl_8x8x64`(f=32, d=64), `autoencoder_kl_16x16x16`(f=16, d=16), \n`autoencoder_kl_32x32x4`(f=8, d=4), `autoencoder_kl_64x64x3`(f=4, d=3)}.\n\nFor training VQ-regularized models, see the [taming-transformers](https:\u002F\u002Fgithub.com\u002FCompVis\u002Ftaming-transformers) \nrepository.\n\n### Training LDMs \n\nIn ``configs\u002Flatent-diffusion\u002F`` we provide configs for training LDMs on the LSUN-, CelebA-HQ, FFHQ and ImageNet datasets. \nTraining can be started by running\n\n```shell script\nCUDA_VISIBLE_DEVICES=\u003CGPU_ID> python main.py --base configs\u002Flatent-diffusion\u002F\u003Cconfig_spec>.yaml -t --gpus 0,\n``` \n\nwhere ``\u003Cconfig_spec>`` is one of {`celebahq-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),`ffhq-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),\n`lsun_bedrooms-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),\n`lsun_churches-ldm-vq-4`(f=8, KL-reg. autoencoder, spatial size 32x32x4),`cin-ldm-vq-8`(f=8, VQ-reg. autoencoder, spatial size 32x32x4)}.\n\n# Model Zoo \n\n## Pretrained Autoencoding Models\n![rec2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_5525ca4033eb.png)\n\nAll models were trained until convergence (no further substantial improvement in rFID).\n\n| Model                   | rFID vs val | train steps           |PSNR           | PSIM          | Link                                                                                                                                                  | Comments              \n|-------------------------|------------|----------------|----------------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|\n| f=4, VQ (Z=8192, d=3)   | 0.58       | 533066 | 27.43  +\u002F- 4.26 | 0.53 +\u002F- 0.21 |     https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fvq-f4.zip                   |  |\n| f=4, VQ (Z=8192, d=3)   | 1.06       | 658131 | 25.21 +\u002F-  4.17 | 0.72 +\u002F- 0.26 | https:\u002F\u002Fheibox.uni-heidelberg.de\u002Ff\u002F9c6681f64bb94338a069\u002F?dl=1  | no attention          |\n| f=8, VQ (Z=16384, d=4)  | 1.14       | 971043 | 23.07 +\u002F- 3.99 | 1.17 +\u002F- 0.36 |       https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fvq-f8.zip                     |                       |\n| f=8, VQ (Z=256, d=4)    | 1.49       | 1608649 | 22.35 +\u002F- 3.81 | 1.26 +\u002F- 0.37 |   https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fvq-f8-n256.zip |  \n| f=16, VQ (Z=16384, d=8) | 5.15       | 1101166 | 20.83 +\u002F- 3.61 | 1.73 +\u002F- 0.43 |             https:\u002F\u002Fheibox.uni-heidelberg.de\u002Ff\u002F0e42b04e2e904890a9b6\u002F?dl=1                        |                       |\n|                         |            |  |                |               |                                                                                                                                                    |                       |\n| f=4, KL                 | 0.27       | 176991 | 27.53 +\u002F- 4.54 | 0.55 +\u002F- 0.24 |     https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fkl-f4.zip                                   |                       |\n| f=8, KL                 | 0.90       | 246803 | 24.19 +\u002F- 4.19 | 1.02 +\u002F- 0.35 |             https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fkl-f8.zip                            |                       |\n| f=16, KL     (d=16)     | 0.87       | 442998 | 24.08 +\u002F- 4.22 | 1.07 +\u002F- 0.36 |      https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fkl-f16.zip                                  |                       |\n | f=32, KL     (d=64)     | 2.04       | 406763 | 22.27 +\u002F- 3.93 | 1.41 +\u002F- 0.40 |             https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fkl-f32.zip                            |                       |\n\n### Get the models\n\nRunning the following script downloads und extracts all available pretrained autoencoding models.   \n```shell script\nbash scripts\u002Fdownload_first_stages.sh\n```\n\nThe first stage models can then be found in `models\u002Ffirst_stage_models\u002F\u003Cmodel_spec>`\n\n\n\n## Pretrained LDMs\n| Datset                          |   Task    | Model        | FID           | IS              | Prec | Recall | Link                                                                                                                                                                                   | Comments                                        \n|---------------------------------|------|--------------|---------------|-----------------|------|------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|\n| CelebA-HQ                       | Unconditional Image Synthesis    |  LDM-VQ-4 (200 DDIM steps, eta=0)| 5.11 (5.11)          | 3.29            | 0.72    | 0.49 |    https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fceleba.zip     |                                                 |  \n| FFHQ                            | Unconditional Image Synthesis    |  LDM-VQ-4 (200 DDIM steps, eta=1)| 4.98 (4.98)  | 4.50 (4.50)   | 0.73 | 0.50 |              https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fffhq.zip                                              |                                                 |\n| LSUN-Churches                   | Unconditional Image Synthesis   |  LDM-KL-8 (400 DDIM steps, eta=0)| 4.02 (4.02) | 2.72 | 0.64 | 0.52 |         https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Flsun_churches.zip        |                                                 |  \n| LSUN-Bedrooms                   | Unconditional Image Synthesis   |  LDM-VQ-4 (200 DDIM steps, eta=1)| 2.95 (3.0)          | 2.22 (2.23)| 0.66 | 0.48 | https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Flsun_bedrooms.zip |                                                 |  \n| ImageNet                        | Class-conditional Image Synthesis | LDM-VQ-8 (200 DDIM steps, eta=1) | 7.77(7.76)* \u002F15.82** | 201.56(209.52)* \u002F78.82** | 0.84* \u002F 0.65** | 0.35* \u002F 0.63** |   https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fcin.zip                                                                   | *: w\u002F guiding, classifier_scale 10  **: w\u002Fo guiding, scores in bracket calculated with script provided by [ADM](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fguided-diffusion) |   \n| Conceptual Captions             |  Text-conditional Image Synthesis | LDM-VQ-f4 (100 DDIM steps, eta=0) | 16.79         | 13.89           | N\u002FA | N\u002FA |              https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Ftext2img.zip                                | finetuned from LAION                            |   \n| OpenImages                      | Super-resolution   | LDM-VQ-4     | N\u002FA            | N\u002FA               | N\u002FA    | N\u002FA    |                                    https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fsr_bsr.zip                                    | BSR image degradation                           |\n| OpenImages                      | Layout-to-Image Synthesis    | LDM-VQ-4 (200 DDIM steps, eta=0) | 32.02         | 15.92           | N\u002FA    | N\u002FA    |                  https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Flayout2img_model.zip                                           |                                                 | \n| Landscapes      |  Semantic Image Synthesis   | LDM-VQ-4  | N\u002FA             | N\u002FA               | N\u002FA    | N\u002FA    |           https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fsemantic_synthesis256.zip                                    |                                                 |\n| Landscapes       |  Semantic Image Synthesis   | LDM-VQ-4  | N\u002FA             | N\u002FA               | N\u002FA    | N\u002FA    |           https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fsemantic_synthesis.zip                                    |             finetuned on resolution 512x512                                     |\n\n\n### Get the models\n\nThe LDMs listed above can jointly be downloaded and extracted via\n\n```shell script\nbash scripts\u002Fdownload_models.sh\n```\n\nThe models can then be found in `models\u002Fldm\u002F\u003Cmodel_spec>`.\n\n\n\n## Coming Soon...\n\n* More inference scripts for conditional LDMs.\n* In the meantime, you can play with our colab notebook https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1xqzUi2iXQXDqXBHQGP9Mqt2YrYW6cx-J?usp=sharing\n\n## Comments \n\n- Our codebase for the diffusion models builds heavily on [OpenAI's ADM codebase](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fguided-diffusion)\nand [https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fdenoising-diffusion-pytorch](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fdenoising-diffusion-pytorch). \nThanks for open-sourcing!\n\n- The implementation of the transformer encoder is from [x-transformers](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fx-transformers) by [lucidrains](https:\u002F\u002Fgithub.com\u002Flucidrains?tab=repositories). \n\n\n## BibTeX\n\n```\n@misc{rombach2021highresolution,\n      title={High-Resolution Image Synthesis with Latent Diffusion Models}, \n      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},\n      year={2021},\n      eprint={2112.10752},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n\n@misc{https:\u002F\u002Fdoi.org\u002F10.48550\u002Farxiv.2204.11824,\n  doi = {10.48550\u002FARXIV.2204.11824},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.11824},\n  author = {Blattmann, Andreas and Rombach, Robin and Oktay, Kaan and Ommer, Björn},\n  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},\n  title = {Retrieval-Augmented Diffusion Models},\n  publisher = {arXiv},\n  year = {2022},  \n  copyright = {arXiv.org perpetual, non-exclusive license}\n}\n\n\n```\n\n\n","# 潜扩散模型\n[arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10752) | [BibTeX](#bibtex)\n\n\u003Cp align=\"center\">\n\u003Cimg src=assets\u002Fresults.gif \u002F>\n\u003C\u002Fp>\n\n\n\n[**高分辨率图像合成中的潜扩散模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10752)\u003Cbr\u002F>\n[Robin Rombach](https:\u002F\u002Fgithub.com\u002Frromb)\\*,\n[Andreas Blattmann](https:\u002F\u002Fgithub.com\u002Fablattmann)\\*,\n[Dominik Lorenz](https:\u002F\u002Fgithub.com\u002Fqp-qp)\\,\n[Patrick Esser](https:\u002F\u002Fgithub.com\u002Fpesser),\n[Björn Ommer](https:\u002F\u002Fhci.iwr.uni-heidelberg.de\u002FStaff\u002Fbommer)\u003Cbr\u002F>\n\\* 等贡献\n\n\u003Cp align=\"center\">\n\u003Cimg src=assets\u002Fmodelfigure.png \u002F>\n\u003C\u002Fp>\n\n## 新闻\n\n### 2022年7月\n- 用于运行我们[检索增强扩散模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.11824)的推理代码和模型权重现已可用。请参阅[本节](#retrieval-augmented-diffusion-models)。\n### 2022年4月\n- 感谢[Katherine Crowson](https:\u002F\u002Fgithub.com\u002Fcrowsonkb)，无分类器指导获得了约2倍的速度提升，且[PLMS采样器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.09778)现已可用。另请参阅[此PR](https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion\u002Fpull\u002F51)。\n\n- 我们的14.5亿参数[潜扩散LAION模型](#text-to-image)已通过[Gradio](https:\u002F\u002Fgithub.com\u002Fgradio-app\u002Fgradio)集成到[Huggingface Spaces 🤗](https:\u002F\u002Fhuggingface.co\u002Fspaces)中。试用Web演示：[![Hugging Face Spaces](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmultimodalart\u002Flatentdiffusion)\n\n- 更多预训练的LDMs现已可用：\n  - 一个在[LAION-400M](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.02114)数据库上训练的14.5亿参数[模型](#text-to-image)。\n  - 一个基于ImageNet的类别条件模型，在使用[无分类器指导](https:\u002F\u002Fopenreview.net\u002Fpdf?id=qw8AKxfYbI)时达到了3.6的FID值。可通过[Colab笔记本](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FCompVis\u002Flatent-diffusion\u002Fblob\u002Fmain\u002Fscripts\u002Flatent_imagenet_diffusion.ipynb)获取[![][colab]][colab-cin]。\n\n## 要求\n可以创建并激活一个名为`ldm`的合适[conda](https:\u002F\u002Fconda.io\u002F)环境，方法如下：\n\n```\nconda env create -f environment.yaml\nconda activate ldm\n```\n\n# 预训练模型\n所有可用检查点的通用列表可通过我们的[模型库](#model-zoo)获得。如果您在工作中使用了这些模型之一，我们非常乐意收到您的[引用](#bibtex)。\n\n## 检索增强扩散模型\n![rdm-figure](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_cc5bccf3b904.jpg)\n我们包含了用于运行我们在[https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.11824](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.11824)中描述的检索增强扩散模型（RDMs）的推理代码。\n\n要开始使用，请将额外需要的Python包安装到您的`ldm`环境中：\n```shell script\npip install transformers==4.19.2 scann kornia==0.6.4 torchmetrics==0.6.0\npip install git+https:\u002F\u002Fgithub.com\u002Farogozhnikov\u002Feinops.git\n```\n并下载训练好的权重（初步检查点）：\n\n```bash\nmkdir -p models\u002Frdm\u002Frdm768x768\u002F\nwget -O models\u002Frdm\u002Frdm768x768\u002Fmodel.ckpt https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Frdm\u002Fmodel.ckpt\n```\n由于这些模型是基于一组CLIP图像嵌入进行条件化的，因此我们的RDMs支持不同的推理模式，具体说明如下。\n#### 仅使用文本提示的RDM（无需显式检索）\n由于CLIP提供共享的图像\u002F文本特征空间，且RDMs在训练过程中学会覆盖给定示例的邻域，我们可以直接采用给定提示的CLIP文本嵌入作为条件。通过以下命令运行此模式：\n```\npython scripts\u002Fknn2img.py  --prompt \"一只快乐的熊正在读报纸，油画\"\n```\n\n#### 基于文本到图像检索的RDM\n\n要运行一个既基于文本提示又结合从该提示中检索到的图像的RDM，您还需要下载相应的检索数据库。我们提供了两个不同的数据库，分别提取自[Openimages-](https:\u002F\u002Fstorage.googleapis.com\u002Fopenimages\u002Fweb\u002Findex.html)和[ArtBench-](https:\u002F\u002Fgithub.com\u002Fliaopeiyuan\u002Fartbench)数据集。切换数据库会导致模型表现出不同的能力，如下所示，尽管两种情况下的学习权重是相同的。\n\n下载包含检索数据集（[Openimages](https:\u002F\u002Fstorage.googleapis.com\u002Fopenimages\u002Fweb\u002Findex.html) (~11GB)和[ArtBench](https:\u002F\u002Fgithub.com\u002Fliaopeiyuan\u002Fartbench) (~82MB)）并压缩为CLIP图像嵌入的检索数据库：\n```bash\nmkdir -p data\u002Frdm\u002Fretrieval_databases\nwget -O data\u002Frdm\u002Fretrieval_databases\u002Fartbench.zip https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Frdm\u002Fartbench_databases.zip\nwget -O data\u002Frdm\u002Fretrieval_databases\u002Fopenimages.zip https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Frdm\u002Fopenimages_database.zip\nunzip data\u002Frdm\u002Fretrieval_databases\u002Fartbench.zip -d data\u002Frdm\u002Fretrieval_databases\u002F\nunzip data\u002Frdm\u002Fretrieval_databases\u002Fopenimages.zip -d data\u002Frdm\u002Fretrieval_databases\u002F\n```\n我们还提供了针对ArtBench的训练好的[ScaNN](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002Ftree\u002Fmaster\u002Fscann)搜索索引。通过以下命令下载并解压：\n```bash\nmkdir -p data\u002Frdm\u002Fsearchers\nwget -O data\u002Frdm\u002Fsearchers\u002Fartbench.zip https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Frdm\u002Fartbench_searchers.zip\nunzip data\u002Frdm\u002Fsearchers\u002Fartbench.zip -d data\u002Frdm\u002Fsearchers\n```\n\n由于OpenImages的索引较大（~21 GB），我们提供了一个脚本用于创建并保存它，以便在采样时使用。请注意，\n如果没有这个索引，将无法使用OpenImages数据库进行采样。通过以下命令运行该脚本：\n```bash\npython scripts\u002Ftrain_searcher.py\n```\n\n基于检索的文本引导采样，结合视觉最近邻，可以通过以下命令启动：\n```bash\npython scripts\u002Fknn2img.py  --prompt \"一颗快乐的菠萝\" --use_neighbors --knn \u003C邻居数量> \n```\n请注意，支持的最大邻居数为20。数据库可以通过cmd参数``--database``更改，可选值为 `[openimages, artbench-art_nouveau, artbench-baroque, artbench-expressionism, artbench-impressionism, artbench-post_impressionism, artbench-realism, artbench-renaissance, artbench-romanticism, artbench-surrealism, artbench-ukiyo_e]`。若要使用`--database openimages`，必须先执行上述脚本（`scripts\u002Ftrain_searcher.py`）。由于ArtBench数据集规模相对较小，它们最适合用于生成更抽象的概念，而不适用于精细的文本控制。\n\n\n#### 即将推出\n- 更好的模型\n- 更多分辨率\n- 图像到图像检索\n\n## 文本生成图像\n![text2img-figure](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_8c52d997ba27.png) \n\n\n下载预训练权重（5.7GB）\n```\nmkdir -p models\u002Fldm\u002Ftext2img-large\u002F\nwget -O models\u002Fldm\u002Ftext2img-large\u002Fmodel.ckpt https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fnitro\u002Ftxt2img-f8-large\u002Fmodel.ckpt\n```\n并使用以下命令进行采样：\n```\npython scripts\u002Ftxt2img.py --prompt \"一个病毒怪物正在弹吉他，油画布\" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0  --ddim_steps 50\n```\n这将会在指定的输出目录（默认为`outputs\u002Ftxt2img-samples`）下，分别保存每个样本以及一个大小为`n_iter` x `n_samples`的网格图。\n质量、采样速度和多样性主要通过`scale`、`ddim_steps`和`ddim_eta`参数来控制。\n一般来说，较高的`scale`值会生成更好的样本，但会降低输出的多样性。  \n此外，增加`ddim_steps`通常也会提高样本的质量，但对于超过250步的情况，收益会逐渐递减。\n若想在保持良好质量的同时加快采样速度（即减少`ddim_steps`），可以使用`--ddim_eta 0.0`。  \n如果希望进一步加快采样速度（即更低的`ddim_steps`），则可以在使用`--ddim_eta 0.0`的基础上再添加`--plms`选项（参见[流形上的扩散模型伪数值方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.09778))。\n\n#### 超过256²\n\n对于某些输入，直接以卷积方式在比模型训练时更大的特征图上运行该模型，\n有时也能得到有趣的结果。要尝试这一点，可以调整`H`和`W`参数（它们会被整除8以计算对应的潜在尺寸），例如运行：\n```\npython scripts\u002Ftxt2img.py --prompt \"山峦后的日落，矢量图\" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0  \n```\n以生成384x1024大小的样本。需要注意的是，与256x256的设置相比，可控性会有所降低。\n\n下面的例子就是使用上述命令生成的。\n![text2img-figure-conv](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_898c1939c979.png)\n\n\n\n## 图像修复\n![inpainting](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_36fd1cf9d19d.png)\n\n下载预训练权重\n```\nwget -O models\u002Fldm\u002Finpainting_big\u002Flast.ckpt https:\u002F\u002Fheibox.uni-heidelberg.de\u002Ff\u002F4d9ac7ea40c64582b7c9\u002F?dl=1\n```\n\n并使用以下命令进行采样：\n```\npython scripts\u002Finpaint.py --indir data\u002Finpainting_examples\u002F --outdir outputs\u002Finpainting_results\n```\n`indir`目录应包含`*.png`格式的图片及其对应的掩码文件`\u003Cimage_fname>_mask.png`，如`data\u002Finpainting_examples`中提供的示例所示。\n\n## 类条件ImageNet\n\n可通过[笔记本](scripts\u002Flatent_imagenet_diffusion.ipynb)访问 [![][colab]][colab-cin]。\n![class-conditional](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_003fc4375cbf.png)\n\n[colab]: \u003Chttps:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg>\n[colab-cin]: \u003Chttps:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FCompVis\u002Flatent-diffusion\u002Fblob\u002Fmain\u002Fscripts\u002Flatent_imagenet_diffusion.ipynb>\n\n\n## 无条件模型\n\n我们还提供了一个用于从无条件LDMs（如LSUN、FFHQ等）中采样的脚本。可以通过以下命令启动：\n\n```shell script\nCUDA_VISIBLE_DEVICES=\u003CGPU_ID> python scripts\u002Fsample_diffusion.py -r models\u002Fldm\u002F\u003Cmodel_spec>\u002Fmodel.ckpt -l \u003Clogdir> -n \u003C\\#samples> --batch_size \u003Cbatch_size> -c \u003C\\#ddim steps> -e \u003C\\#eta> \n```\n\n# 训练您自己的LDMs\n\n## 数据准备\n\n### 人脸\n要下载CelebA-HQ和FFHQ数据集，请按照[taming-transformers](https:\u002F\u002Fgithub.com\u002FCompVis\u002Ftaming-transformers#celeba-hq)仓库中的说明操作。\n\n### LSUN \n\nLSUN数据集可以通过此处提供的脚本方便地下载：[https:\u002F\u002Fgithub.com\u002Ffyu\u002Flsun](https:\u002F\u002Fgithub.com\u002Ffyu\u002Flsun)。  \n我们对数据进行了自定义划分，分为训练集和验证集，并将相应的文件名列表放在[https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flsun.zip](https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flsun.zip)。  \n下载后，请将其解压到`.\u002Fdata\u002Flsun`目录下。其中的bedrooms\u002Fcats\u002Fchurches子集也应分别放置或创建符号链接至`.\u002Fdata\u002Flsun\u002Fbedrooms`\u002F`.\u002Fdata\u002Flsun\u002Fcats`\u002F`.\u002Fdata\u002Flsun\u002Fchurches`目录。\n\n### ImageNet\n代码首次使用时会尝试通过[Academic Torrents](http:\u002F\u002Facademictorrents.com\u002F)下载并准备ImageNet数据。然而，由于ImageNet数据量较大，这需要大量的磁盘空间和时间。  \n如果您已经拥有ImageNet数据，则可以通过将其放入`${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_{split}\u002Fdata\u002F`（默认路径为`~\u002F.cache\u002Fautoencoders\u002Fdata\u002FILSVRC2012_{split}\u002Fdata\u002F`）来加速流程，其中`{split}`可取`train`或`validation`。  \n其目录结构应如下所示：\n\n```\n${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_{split}\u002Fdata\u002F\n├── n01440764\n│   ├── n01440764_10026.JPEG\n│   ├── n01440764_10027.JPEG\n│   ├── ...\n├── n01443537\n│   ├── n01443537_10007.JPEG\n│   ├── n01443537_10014.JPEG\n│   ├── ...\n├── ...\n```\n\n如果您尚未解压数据，也可以将`ILSVRC2012_img_train.tar`\u002F`ILSVRC2012_img_val.tar`（或它们的符号链接）放入`${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_train\u002F` \u002F `${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_validation\u002F`，系统会自动解压成上述结构，而无需再次下载。  \n请注意，只有当不存在`${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_{split}\u002Fdata\u002F`文件夹或`${XDG_CACHE}\u002Fautoencoders\u002Fdata\u002FILSVRC2012_{split}\u002F.ready`文件时，才会执行此操作。若要强制重新运行数据准备流程，需删除这些文件。\n\n\n## 模型训练\n\n已训练模型的日志和检查点将保存到`logs\u002F\u003CSTART_DATE_AND_TIME>_\u003Cconfig_spec>`目录下。\n\n### 自编码器模型的训练\n\n我们在`configs\u002Fautoencoder`中提供了用于在ImageNet上训练KL正则化自编码器的配置文件。  \n训练可以通过以下命令开始：\n```\nCUDA_VISIBLE_DEVICES=\u003CGPU_ID> python main.py --base configs\u002Fautoencoder\u002F\u003Cconfig_spec>.yaml -t --gpus 0,    \n```\n其中`config_spec`可取{`autoencoder_kl_8x8x64`(f=32, d=64), `autoencoder_kl_16x16x16`(f=16, d=16), \n`autoencoder_kl_32x32x4`(f=8, d=4), `autoencoder_kl_64x64x3`(f=4, d=3)}。\n\n有关VQ正则化的模型训练，请参阅[taming-transformers](https:\u002F\u002Fgithub.com\u002FCompVis\u002Ftaming-transformers)仓库。\n\n### LDMs的训练\n\n在`configs\u002Flatent-diffusion\u002F`中，我们提供了针对LSUN、CelebA-HQ、FFHQ和ImageNet数据集训练LDMs的配置文件。  \n训练可以通过以下命令开始：\n\n```shell script\nCUDA_VISIBLE_DEVICES=\u003CGPU_ID> python main.py --base configs\u002Flatent-diffusion\u002F\u003Cconfig_spec>.yaml -t --gpus 0,\n``` \n\n其中`\u003Cconfig_spec>`可取{`celebahq-ldm-vq-4`(f=4，VQ正则化自编码器，空间尺寸64x64x3)，`ffhq-ldm-vq-4`(f=4，VQ正则化自编码器，空间尺寸64x64x3),\n`lsun_bedrooms-ldm-vq-4`(f=4，VQ正则化自编码器，空间尺寸64x64x3),\n`lsun_churches-ldm-vq-4`(f=8，KL正则化自编码器，空间尺寸32x32x4)，`cin-ldm-vq-8`(f=8，VQ正则化自编码器，空间尺寸32x32x4)}。\n\n# 模型库\n\n## 预训练自编码模型\n![rec2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_readme_5525ca4033eb.png)\n\n所有模型均训练至收敛（rFID不再有显著提升）。\n\n| 模型                   | rFID vs val | 训练步数           | PSNR           | PSIM          | 链接                                                                                                                                                  | 备注              \n|-------------------------|------------|----------------|----------------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|\n| f=4, VQ (Z=8192, d=3)   | 0.58       | 533066 | 27.43  +\u002F- 4.26 | 0.53 +\u002F- 0.21 |     https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fvq-f4.zip                   |  |\n| f=4, VQ (Z=8192, d=3)   | 1.06       | 658131 | 25.21 +\u002F-  4.17 | 0.72 +\u002F- 0.26 | https:\u002F\u002Fheibox.uni-heidelberg.de\u002Ff\u002F9c6681f64bb94338a069\u002F?dl=1  | 无注意力机制          |\n| f=8, VQ (Z=16384, d=4)  | 1.14       | 971043 | 23.07 +\u002F- 3.99 | 1.17 +\u002F- 0.36 |       https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fvq-f8.zip                     |                       |\n| f=8, VQ (Z=256, d=4)    | 1.49       | 1608649 | 22.35 +\u002F- 3.81 | 1.26 +\u002F- 0.37 |   https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fvq-f8-n256.zip |  \n| f=16, VQ (Z=16384, d=8) | 5.15       | 1101166 | 20.83 +\u002F- 3.61 | 1.73 +\u002F- 0.43 |             https:\u002F\u002Fheibox.uni-heidelberg.de\u002Ff\u002F0e42b04e2e904890a9b6\u002F?dl=1                        |                       |\n|                         |            |  |                |               |                                                                                                                                                    |                       |\n| f=4, KL                 | 0.27       | 176991 | 27.53 +\u002F- 4.54 | 0.55 +\u002F- 0.24 |     https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fkl-f4.zip                                   |                       |\n| f=8, KL                 | 0.90       | 246803 | 24.19 +\u002F- 4.19 | 1.02 +\u002F- 0.35 |             https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fkl-f8.zip                            |                       |\n| f=16, KL     (d=16)     | 0.87       | 442998 | 24.08 +\u002F- 4.22 | 1.07 +\u002F- 0.36 |      https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fkl-f16.zip                                  |                       |\n | f=32, KL     (d=64)     | 2.04       | 406763 | 22.27 +\u002F- 3.93 | 1.41 +\u002F- 0.40 |             https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fkl-f32.zip                            |                       |\n\n### 获取模型\n\n运行以下脚本可下载并解压所有可用的预训练自编码模型。   \n```shell script\nbash scripts\u002Fdownload_first_stages.sh\n```\n\n第一阶段模型随后可在 `models\u002Ffirst_stage_models\u002F\u003Cmodel_spec>` 中找到。\n\n\n\n## 预训练 LDMs\n| 数据集                          | 任务    | 模型        | FID           | IS              | 精确率 | 召回率 | 链接                                                                                                                                                                                   | 备注                                        \n|---------------------------------|------|--------------|---------------|-----------------|------|------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|\n| CelebA-HQ                       | 无条件图像合成    |  LDM-VQ-4 (200 步 DDIM，eta=0)| 5.11 (5.11)          | 3.29            | 0.72    | 0.49 |    https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fceleba.zip     |                                                 |  \n| FFHQ                            | 无条件图像合成    |  LDM-VQ-4 (200 步 DDIM，eta=1)| 4.98 (4.98)  | 4.50 (4.50)   | 0.73 | 0.50 |              https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fffhq.zip                                              |                                                 |\n| LSUN-Churches                   | 无条件图像合成   |  LDM-KL-8 (400 步 DDIM，eta=0)| 4.02 (4.02) | 2.72 | 0.64 | 0.52 |         https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Flsun_churches.zip        |                                                 |  \n| LSUN-Bedrooms                   | 无条件图像合成   |  LDM-VQ-4 (200 步 DDIM，eta=1)| 2.95 (3.0)          | 2.22 (2.23)| 0.66 | 0.48 | https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Flsun_bedrooms.zip |                                                 |  \n| ImageNet                        | 类别条件图像合成 | LDM-VQ-8 (200 步 DDIM，eta=1) | 7.77(7.76)* \u002F15.82** | 201.56(209.52)* \u002F78.82** | 0.84* \u002F 0.65** | 0.35* \u002F 0.63** |   https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fcin.zip                                                                   | *: 使用引导，classifier_scale 10  **: 无引导，括号内分数由 [ADM](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fguided-diffusion) 提供的脚本计算 |   \n| Conceptual Captions             | 文本条件图像合成 | LDM-VQ-f4 (100 步 DDIM，eta=0) | 16.79         | 13.89           | N\u002FA | N\u002FA |              https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Ftext2img.zip                                | 基于 LAION 微调                            |   \n| OpenImages                      | 超分辨率   | LDM-VQ-4     | N\u002FA            | N\u002FA               | N\u002FA    | N\u002FA    |                                    https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fsr_bsr.zip                                    | BSR 图像退化                           |\n| OpenImages                      | 布局到图像合成    | LDM-VQ-4 (200 步 DDIM，eta=0) | 32.02         | 15.92           | N\u002FA    | N\u002FA    |                  https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Flayout2img_model.zip                                           |                                                 | \n| Landscapes      | 语义图像合成   | LDM-VQ-4  | N\u002FA             | N\u002FA               | N\u002FA    | N\u002FA    |           https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fsemantic_synthesis256.zip                                    |                                                 |\n| Landscapes       | 语义图像合成   | LDM-VQ-4  | N\u002FA             | N\u002FA               | N\u002FA    | N\u002FA    |           https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fsemantic_synthesis.zip                                    |             在 512x512 分辨率下微调                                     |\n\n\n### 获取模型\n\n可通过以下命令联合下载并解压上述 LDMs：\n\n```shell script\nbash scripts\u002Fdownload_models.sh\n```\n\n模型随后可在 `models\u002Fldm\u002F\u003Cmodel_spec>` 中找到。\n\n## 即将推出...\n\n* 更多用于条件扩散模型的推理脚本。\n* 同时，您也可以试用我们的 Colab 笔记本：https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1xqzUi2iXQXDqXBHQGP9Mqt2YrYW6cx-J?usp=sharing\n\n## 注释\n\n- 我们的扩散模型代码库大量借鉴了 [OpenAI 的 ADM 代码库](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fguided-diffusion) 和 [https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fdenoising-diffusion-pytorch](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fdenoising-diffusion-pytorch)。感谢开源！\n\n- 变换器编码器的实现来自 [x-transformers](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fx-transformers)，由 [lucidrains](https:\u002F\u002Fgithub.com\u002Flucidrains?tab=repositories) 提供。\n\n\n## BibTeX\n\n```\n@misc{rombach2021highresolution,\n      title={高分辨率图像生成与潜在扩散模型}, \n      author={Robin Rombach、Andreas Blattmann、Dominik Lorenz、Patrick Esser 和 Björn Ommer},\n      year={2021},\n      eprint={2112.10752},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n\n@misc{https:\u002F\u002Fdoi.org\u002F10.48550\u002Farxiv.2204.11824,\n  doi = {10.48550\u002FARXIV.2204.11824},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.11824},\n  author = {Blattmann, Andreas、Rombach, Robin、Oktay, Kaan 和 Ommer, Björn},\n  keywords = {计算机视觉与模式识别（cs.CV）、FOS：计算机与信息科学、FOS：计算机与信息科学},\n  title = {检索增强型扩散模型},\n  publisher = {arXiv},\n  year = {2022},  \n  copyright = {arXiv.org 永久、非独占许可}\n}\n\n\n```","# Latent Diffusion Models 快速上手指南\n\nLatent Diffusion Models (LDM) 是一种高效的图像生成模型，支持文生图、图像修复、类条件生成等多种任务。本指南将帮助你快速搭建环境并运行预训练模型。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐) 或 macOS\n- **GPU**: 支持 CUDA 的 NVIDIA 显卡（建议显存 ≥ 8GB）\n- **Python**: 3.8+\n- **包管理器**: Conda (推荐)\n\n### 前置依赖\n确保已安装 [Conda](https:\u002F\u002Fconda.io\u002F)。国内用户建议使用 [清华镜像源](https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fhelp\u002Fanaconda\u002F) 加速下载。\n\n## 安装步骤\n\n1. **克隆仓库**\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion.git\n   cd latent-diffusion\n   ```\n\n2. **创建并激活 Conda 环境**\n   ```bash\n   conda env create -f environment.yaml\n   conda activate ldm\n   ```\n   > **提示**: 若 `environment.yaml` 下载缓慢，可手动编辑该文件，将 `pip` 源替换为国内镜像（如 `https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`）。\n\n3. **下载预训练模型权重**\n   \n   以**文生图 (Text-to-Image)** 模型为例：\n   ```bash\n   mkdir -p models\u002Fldm\u002Ftext2img-large\u002F\n   wget -O models\u002Fldm\u002Ftext2img-large\u002Fmodel.ckpt https:\u002F\u002Fommer-lab.com\u002Ffiles\u002Flatent-diffusion\u002Fnitro\u002Ftxt2img-f8-large\u002Fmodel.ckpt\n   ```\n   > **注意**: 其他任务（如图像修复 Inpainting、检索增强 RDM）需下载对应的权重文件，请参考原文档 \"Pretrained Models\" 章节获取相应链接。\n\n## 基本使用\n\n### 文生图 (Text-to-Image)\n\n运行以下命令，根据文本提示生成图像：\n\n```bash\npython scripts\u002Ftxt2img.py --prompt \"a virus monster is playing guitar, oil on canvas\" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0 --ddim_steps 50\n```\n\n**参数说明：**\n- `--prompt`: 生成图像的文本描述。\n- `--n_samples`: 每次迭代生成的样本数。\n- `--n_iter`: 迭代次数。\n- `--scale`: 引导尺度（越高越符合提示词，但多样性降低，推荐值 5.0-7.5）。\n- `--ddim_steps`: 采样步数（越高质量越好，>250 后提升不明显，推荐 50）。\n- `--ddim_eta 0.0`: 启用快速采样模式。\n\n生成结果默认保存在 `outputs\u002Ftxt2img-samples` 目录下，包含单张图片和拼接网格图。\n\n### 进阶技巧：高分辨率生成\n尝试生成非标准分辨率（如 384x1024）：\n```bash\npython scripts\u002Ftxt2img.py --prompt \"a sunset behind a mountain range, vector image\" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0\n```\n*注：`H` 和 `W` 必须是 8 的倍数。*\n\n### 图像修复 (Inpainting)\n1. 准备图片 `image.png` 和对应的掩码 `image_mask.png`（白色区域为重绘区），放入 `data\u002Finpainting_examples\u002F`。\n2. 运行：\n   ```bash\n   python scripts\u002Finpaint.py --indir data\u002Finpainting_examples\u002F --outdir outputs\u002Finpainting_results\n   ```","一家独立游戏工作室的美术团队正急需为奇幻题材的新项目批量生成高分辨率的概念原画，以加速前期视觉探索。\n\n### 没有 latent-diffusion 时\n- **显存门槛极高**：直接生成高分辨率图像需要巨大的 GPU 显存，团队昂贵的计算资源经常因内存溢出而崩溃，无法流畅运行大模型。\n- **细节模糊失真**：受限于算力，只能先生成小图再强行放大，导致画面出现严重的伪影和模糊，无法满足专业美术标准。\n- **创作效率低下**：手动绘制多版草图耗时数天，且难以快速响应策划对“特定风格（如油画质感）”的反复修改需求。\n- **风格迁移困难**：缺乏有效的检索增强机制，难以精准参考现有素材库中的构图或色调，导致产出风格不统一。\n\n### 使用 latent-diffusion 后\n- **低显存高效运行**：通过在潜在空间（Latent Space）而非像素空间进行扩散计算，显著降低了显存占用，使普通显卡也能生成高清大图。\n- **原生高清画质**：直接合成高分辨率图像，保留了丰富的纹理细节和清晰的边缘，无需后期超分处理即可用于概念设计。\n- **文本精准控制**：利用强大的文本到图像能力，输入如“一只读报纸的快乐熊，油画风格”即可秒级生成多版高质量方案，大幅缩短迭代周期。\n- **检索增强生成**：借助检索增强扩散模型（RDM）功能，可结合 CLIP 嵌入检索相似参考图，确保生成内容在构图和风格上与项目设定高度一致。\n\nlatent-diffusion 通过将扩散过程压缩至潜在空间，彻底打破了高分辨率图像生成的算力瓶颈，让中小团队也能以低成本实现电影级的视觉创作。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCompVis_latent-diffusion_898c1939.png","CompVis","CompVis - Computer Vision and Learning LMU Munich","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FCompVis_364aef2c.png","Computer Vision and Learning research group at Ludwig Maximilian University of Munich (formerly Computer Vision Group at Heidelberg University)",null,"assist.mvl@lrz.uni-muenchen.de","https:\u002F\u002Fommer-lab.com\u002F","https:\u002F\u002Fgithub.com\u002FCompVis",[85,89,93],{"name":86,"color":87,"percentage":88},"Jupyter Notebook","#DA5B0B",90.5,{"name":90,"color":91,"percentage":92},"Python","#3572A5",9.4,{"name":94,"color":95,"percentage":96},"Shell","#89e051",0.1,13957,1720,"2026-04-05T23:38:54","MIT","Linux","必需 NVIDIA GPU (通过 CUDA_VISIBLE_DEVICES 环境变量控制)，具体显存大小未说明（大模型如 1.45B 参数及高分辨率生成通常建议 8GB+），CUDA 版本未说明","未说明 (检索数据库下载需额外磁盘空间，OpenImages 索引约 21GB)",{"notes":105,"python":106,"dependencies":107},"1. 必须使用 conda 创建名为 'ldm' 的环境并安装 environment.yaml 中的依赖。2. 针对不同功能（如检索增强扩散模型 RDM），需额外安装特定版本的 Python 包。3. 首次运行需手动下载预训练权重文件（文生图模型约 5.7GB，修复模型及其他变体大小不一）。4. 若使用 OpenImages 数据库进行检索，需预先运行脚本构建搜索索引（约 21GB）。5. 代码主要通过命令行脚本（如 txt2img.py, inpaint.py）运行。","未说明 (依赖 conda 环境配置文件 environment.yaml)",[108,109,110,111,112,113,114,115,116,117],"torch","transformers==4.19.2","scann","kornia==0.6.4","torchmetrics==0.6.0","einops","pytorch-lightning","omegaconf","taming-transformers-rom1504","clip",[14],"2026-03-27T02:49:30.150509","2026-04-06T14:01:28.388376",[122,127,132,136,141,146],{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},19115,"从头训练自编码器（AE）并与 UNet 结合进行超分辨率任务时效果不佳，可能是什么原因？","这通常与模型类类型不匹配或训练配置有关。请检查以下几点：\n1. 确认自编码器的类类型：如果使用预训练的 UNet 配合标准配置文件，它可能期望加载 `VQModelInterface` 类的自编码器，而不是 `AutoencoderKL`。尝试将你的模型改为 `VQModelInterface` 类型。\n2. 训练资源与精度：有用户成功使用单张 GPU（如 A10G）配合混合精度（16-bit float）进行训练，每个部分（AE 和 LDM）训练约 100 个 epoch（约 10 万步），耗时约 3 天。如果数据是 3D 的，计算成本高，建议每批次只训练 1 张图像以优化显存。\n3. 避免分片训练：有反馈表明在 4 张 GPU 上进行分片训练效果不如单卡训练好。","https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion\u002Fissues\u002F270",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},19116,"训练图像修复（Inpainting）模型时损失收敛过快且生成质量差，如何解决？","这个问题通常由学习率设置不当或训练时间不足引起。建议采取以下措施：\n1. 设置无条件概率（Unconditional Probability）：在训练时以一定概率（如 10%）将文本条件设为空，以确保模型具备无条件生成能力，这对于采样时使用 Classifier-Free Guidance 至关重要。\n2. 调整学习率（LR）：学习率需要根据具体数据集进行调整。参考官方论文补充材料中的超参数设置（通常在 Places2 数据集上实验）。注意 `scale_lr` 参数，它会根据使用的显卡数量平均分配学习率。\n3. 延长训练时间：图像修复任务可能需要较长的训练周期，建议训练一周以上。作为参考，在 CelebA 人脸数据集上的无条件生成任务大约需要 4 天才能训好。","https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion\u002Fissues\u002F159",{"id":133,"question_zh":134,"answer_zh":135,"source_url":131},19117,"在使用 Classifier-Free Guidance 进行采样时，如何正确设置 `unconditional_conditioning` 参数？","在采样过程中，如果仅调整 `unconditional_guidance_scale` 而将 `unconditional_conditioning` 设为 `none`，可能无法生效。具体设置取决于条件类型：\n1. 对于类别条件（Class Condition）：参考 `latent-imagenet-diffusion.ipynb`，可以将类别数字设置为超出范围的数值（例如最大类别数 +1），然后将其转换为 embedding 传给 `unconditional_conditioning`。\n2. 对于图像或文本条件：目前社区仍在探讨最佳实践，通常需要构造一个对应的“空”或“零”条件的 embedding 传入，而不是直接设为 None。确保在推理配置中正确加载了支持无条件采样的配置。",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},19118,"为什么使用提供的预训练模型复现 LSUN-Churches 或 CelebA 数据集的 FID 分数时，结果远低于论文报告值？","FID 分数差异巨大通常由以下两个原因导致：\n1. 生成图像数量不足：论文中报告的结果是基于生成 50,000 张图像计算的。如果只生成 10,000 张（为了节省时间），FID 分数会显著变差（例如从 4.02 变为 11.5）。务必生成足够数量的图像（50k）进行评估。\n2. 推理配置错误：确保使用了正确的推理配置文件（inference config）。不同的采样步骤（如 400 步 DDIM）和 eta 值也会影响结果，请严格对照仓库中的默认设置。","https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion\u002Fissues\u002F213",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},19119,"计算 FID 分数时，生成图像的数量和格式有什么具体要求？","为了获得准确且可比的 FID 分数：\n1. 数量一致性：虽然训练集数据量远大于 50k，但在计算 FID 时，通常建议生成图像的数量与评估用的真实图像子集数量保持一致（通常为 50k）。数据量越大，FID 统计越准确。\n2. 图像格式：数据集原始格式可能是 webp，而生成图像通常是 png。在计算 FID 时，工具（如 torch-fidelity）通常能处理不同格式，但最好确保输入目录中的图像格式统一，或者在预处理时将数据集转换为与生成图像相同的格式（如 png），以避免潜在的解码差异。","https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion\u002Fissues\u002F253",{"id":147,"question_zh":148,"answer_zh":149,"source_url":126},19120,"如何在有限的显存下训练 3D 数据的潜扩散模型（LDM）和自编码器（AE）？","针对高计算成本的 3D 数据，可以采用以下策略在单卡上训练：\n1. 使用混合精度训练：启用 16-bit 浮点数（mixed precision） instead of 32-bit，可显著降低显存占用。\n2. 减小 Batch Size：将每批次的图像数量减少到 1 张。\n3. 硬件选择：有成功案例显示使用单张 A10G GPU 即可完成训练。\n4. 训练时长：在这种配置下，AE 和 LDM 各需训练约 100 个 epoch（约 10 万步），总耗时约为 3 天。避免使用多卡分片训练，因为有反馈表明其效果不如单卡稳定。",[]]