[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-TencentQQGYLab--ELLA":3,"tool-TencentQQGYLab--ELLA":64},[4,17,26,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,2,"2026-04-03T11:11:01",[13,14,15],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":23,"last_commit_at":32,"category_tags":33,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,34,35,36,15,37,38,13,39],"数据工具","视频","插件","其他","语言模型","音频",{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":10,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,38,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74939,"2026-04-05T23:16:38",[38,14,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":23,"last_commit_at":62,"category_tags":63,"status":16},2471,"tesseract","tesseract-ocr\u002Ftesseract","Tesseract 是一款历史悠久且备受推崇的开源光学字符识别（OCR）引擎，最初由惠普实验室开发，后由 Google 维护，目前由全球社区共同贡献。它的核心功能是将图片中的文字转化为可编辑、可搜索的文本数据，有效解决了从扫描件、照片或 PDF 文档中提取文字信息的难题，是数字化归档和信息自动化的重要基础工具。\n\n在技术层面，Tesseract 展现了强大的适应能力。从版本 4 开始，它引入了基于长短期记忆网络（LSTM）的神经网络 OCR 引擎，显著提升了行识别的准确率；同时，为了兼顾旧有需求，它依然支持传统的字符模式识别引擎。Tesseract 原生支持 UTF-8 编码，开箱即用即可识别超过 100 种语言，并兼容 PNG、JPEG、TIFF 等多种常见图像格式。输出方面，它灵活支持纯文本、hOCR、PDF、TSV 等多种格式，方便后续数据处理。\n\nTesseract 主要面向开发者、研究人员以及需要构建文档处理流程的企业用户。由于它本身是一个命令行工具和库（libtesseract），不包含图形用户界面（GUI），因此最适合具备一定编程能力的技术人员集成到自动化脚本或应用程序中",73286,"2026-04-03T01:56:45",[13,14],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":77,"owner_location":77,"owner_email":77,"owner_twitter":77,"owner_website":77,"owner_url":78,"languages":79,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":10,"env_os":96,"env_gpu":97,"env_ram":96,"env_deps":98,"category_tags":105,"github_topics":77,"view_count":23,"oss_zip_url":77,"oss_zip_packed_at":77,"status":16,"created_at":106,"updated_at":107,"faqs":108,"releases":139},2224,"TencentQQGYLab\u002FELLA","ELLA","ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment","ELLA 是一款旨在提升文生图质量的开源项目，其核心理念是将大型语言模型（LLM）的强大语义理解能力赋予扩散模型。传统绘图工具往往难以精准捕捉复杂提示词中的深层含义，导致生成图像与文字描述出现偏差。ELLA 通过引入先进的语言模型作为“大脑”，显著增强了图像生成过程中的语义对齐能力，让画面能更准确、细腻地还原用户脑海中那些抽象或复杂的创意描述。\n\n该项目特别适合 AI 研究人员、开发者以及追求高质量创作效果的设计师使用。对于希望深入探索多模态技术融合的极客，或是需要利用 ComfyUI 等工作流进行专业创作的艺术家，ELLA 提供了丰富的接口和插件支持。其独特亮点在于不仅发布了基于 SD1.5 的优化模型，还推出了便捷的 ComfyUI 插件，让用户能轻松在现有工作流中体验升级后的生成效果。此外，团队后续推出的 EMMA 项目进一步拓展了其处理多模态提示的能力。作为一个处于快速迭代中的前沿研究项目，ELLA 为突破当前文生图技术的语义瓶颈提供了极具价值的解决方案。","# ELLA & EMMA\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Ch2> ELLA \u003C\u002Fh2>\n      \u003Cp> Paper: \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135\">ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment \u003C\u002Fa>\u003C\u002Fp>\n      \u003Cp> Project Website: \u003Ca href=\"https:\u002F\u002Fella-diffusion.github.io\u002F\">ELLA\u003C\u002Fa> \u003C\u002Fp>\n    \u003C\u002Ftd>\n    \u003Ctd>\n      \u003Ch2> EMMA \u003C\u002Fh2>\n      \u003Cp> Paper: \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09162\">EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts\u003C\u002Fa>\u003C\u002Fp>\n      \u003Cp> Project Website: \u003Ca href=\"https:\u002F\u002Ftencentqqgylab.github.io\u002FEMMA\u002F\">EMMA\u003C\u002Fa> \u003C\u002Fp>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment\n\n\u003Cdiv align=\"center\">\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fprofile?id=~Xiwei_Hu1\">Xiwei Hu*\u003C\u002Fa>,\n\u003C\u002Fspan>\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fwrong.wang\u002F\">Rui Wang*\u003C\u002Fa>,\n\u003C\u002Fspan>\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fprofile?id=~Yixiao_Fang1\">Yixiao Fang*\u003C\u002Fa>,\n\u003C\u002Fspan>\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fprofile?id=~BIN_FU2\">Bin Fu*\u003C\u002Fa>,\n\u003C\u002Fspan>\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fprofile?id=~Pei_Cheng1\">Pei Cheng\u003C\u002Fa>,\n\u003C\u002Fspan>\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fwww.skicyyu.org\u002F\">Gang Yu&#10022\u003C\u002Fa>\n\u003C\u002Fspan>\n\u003Cp>\n* Equal contributions, &#10022 Corresponding Author\n\u003C\u002Fp>\n\n\u003Cimg src=\".\u002Fassets\u002FELLA-Diffusion.jpg\" width=\"30%\" > \u003Cbr\u002F>\n\u003Ca href='https:\u002F\u002Fella-diffusion.github.io\u002F'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-green'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2403.05135-b31b1b.svg'>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\nOfficial code of \"ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment\".\n\u003Cp>\n\u003C\u002Fp>\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_9a2b596c16d1.png\" width=\"100%\">\n    \u003Cimg src=\".\u002Fassets\u002Fteaser1_raccoon.png\" width=\"100%\">\n\u003C\u002Fdiv>\n\n## 🌟 Changelog\n\n- **[2024.6.14]** 🔥🔥 EMMA: [Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09162), [Project Website](https:\u002F\u002Ftencentqqgylab.github.io\u002FEMMA\u002F)\n- **[2024.5.13]** EMMA is coming soon. Let's first preview the results of EMMA: [中文版](https:\u002F\u002Fwrong.wang\u002Fblog\u002F20240512-emma\u002F), [English Version](https:\u002F\u002Fwrong.wang\u002Fblog\u002F20240512-what-is-emma\u002F)\n- **[2024.4.19]** We provide ELLA’s ComfyUI plugin: [TencentQQGYLab\u002FComfyUI-ELLA](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FComfyUI-ELLA)\n- **[2024.4.11]** Add some results of [EMMA(Efficient Multi-Modal Adapter)](#emma)\n- **[2024.4.9]** 🔥🔥🔥 Release [ELLA-SD1.5](https:\u002F\u002Fhuggingface.co\u002FQQGYLab\u002FELLA\u002Fblob\u002Fmain\u002Fella-sd1.5-tsc-t5xl.safetensors) Checkpoint! Welcome to try! \n- **[2024.3.11]** 🔥 Release DPG-Bench! Welcome to try! \n- **[2024.3.7]** Initial update\n\n\n## 🚀 Usage\n\n### Download\n\nYou can download ELLA models from [QQGYLab\u002FELLA](https:\u002F\u002Fhuggingface.co\u002FQQGYLab\u002FELLA).\n\n### Quick View\n\n```bash\n# get ELLA-SD1.5 at https:\u002F\u002Fhuggingface.co\u002FQQGYLab\u002FELLA\u002Fblob\u002Fmain\u002Fella-sd1.5-tsc-t5xl.safetensors\n\n# comparing ella-sd1.5 and sd1.5\n# will generate images at `.\u002Fassets\u002Fella-inference-examples`\npython3 inference.py test --save_folder .\u002Fassets\u002Fella-inference-examples --ella_path \u002Fpath\u002Fto\u002Fella-sd1.5-tsc-t5xl.safetensors\n```\n\n### Build a demo for comparing SD1.5 and ELLA-SD1.5\n\n```python\nGRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=8082 python3 .\u002Finference.py demo \u002Fpath\u002Fto\u002Fella-sd1.5-tsc-t5xl.safetensors\n```\n\n### Using ELLA in ComfyUI\n\nWe provide ELLA’s ComfyUI plugin: [TencentQQGYLab\u002FComfyUI-ELLA](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FComfyUI-ELLA), which supports ControlNet, img2img and more. You are welcome to try it out.\n\n\nThanks to [@ExponentialML](https:\u002F\u002Fgithub.com\u002FExponentialML\u002F) and [@kijai](https:\u002F\u002Fgithub.com\u002Fkijai), they offer third-party ComfyUI plugins for ELLA:\n\n1. [ExponentialML\u002FComfyUI_ELLA](https:\u002F\u002Fgithub.com\u002FExponentialML\u002FComfyUI_ELLA\u002F)\n2. [kijai\u002FComfyUI-ELLA-wrapper](https:\u002F\u002Fgithub.com\u002Fkijai\u002FComfyUI-ELLA-wrapper)\n\n## 📙 Notes\n\nELLA is still in its early stages of research, and we have not yet conducted comprehensive testing on all potential applications of ELLA. We welcome constructive and friendly suggestions from the community.\n\nHere, we share some tips that we have discovered thus far on how to better utilize ELLA:\n\n### 1. Caption Upscale\n\nELLA was trained using MLLM-annotated synthetic captions. As mentioned in the [Improving Image Generation with Better Captions](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fdall-e-3.pdf), performing an \"upsampling\" on the input caption before using ELLA can extract its maximum potential.\n\nWe have discovered that leveraging the In-Context Learning (ICL) capability of LLMs can facilitate a straightforward caption upsampler:\n\nexample instruction:\n\n```\nPlease generate the long prompt version of the short one according to the given examples. Long prompt version should consist of 3 to 5 sentences. Long prompt version must sepcify the color, shape, texture or spatial relation of the included objects. DO NOT generate sentences that describe any atmosphere!!!\n\nShort: A calico cat with eyes closed is perched upon a Mercedes.\nLong: a multicolored cat perched atop a shiny black car. the car is parked in front of a building with wooden walls and a green fence. the reflection of the car and the surrounding environment can be seen on the car's glossy surface.\n\nShort: A boys sitting on a chair holding a video game remote.\nLong: a young boy sitting on a chair, wearing a blue shirt and a baseball cap with the letter 'm'. he has a red medal around his neck and is holding a white game controller. behind him, there are two other individuals, one of whom is wearing a backpack. to the right of the boy, there's a blue trash bin with a sign that reads 'automatic party'.\n\nShort: A man is on the bank of the water fishing.\nLong: a serene waterscape where a person, dressed in a blue jacket and a red beanie, stands in shallow waters, fishing with a long rod. the calm waters are dotted with several sailboats anchored at a distance, and a mountain range can be seen in the background under a cloudy sky.\n\nShort: A kitchen with a cluttered counter and wooden cabinets.\nLong: a well-lit kitchen with wooden cabinets, a black and white checkered floor, and a refrigerator adorned with a floral decal on its side. the kitchen countertop holds various items, including a coffee maker, jars, and fruits.\n\nShort: a racoon holding a shiny red apple over its head\n```\n\nusing: https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen-72B-Chat-Demo\n\nwe got: \n\na mischievous raccoon standing on its hind legs, holding a bright red apple aloft in its furry paws. the apple shines brightly against the backdrop of a dense forest, with leaves rustling in the gentle breeze. a few scattered rocks can be seen on the ground beneath the raccoon's feet, while a gnarled tree trunk stands nearby.\n\n\n#### Before and After caption upsampling \n\n\noriginal prompt: *a racoon holding a shiny red apple over its head*\n\n| SD1.5 | ELLA-SD1.5_fixed_token_length | ELLA-SD1.5_flexible_token_length |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_2ab773569540.jpg)\n\nQwen-72B refined caption: *a mischievous raccoon standing on its hind legs, holding a bright red apple aloft in its furry paws. the apple shines brightly against the backdrop of a dense forest, with leaves rustling in the gentle breeze. a few scattered rocks can be seen on the ground beneath the raccoon's feet, while a gnarled tree trunk stands nearby.*\n\n\n| SD1.5 | ELLA-SD1.5_fixed_token_length | ELLA-SD1.5_flexible_token_length |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](.\u002Fassets\u002Fella-sd1.5-notes\u002Fracoon_apple_Qwen-72B-Chat-refined.jpg)\n\n\n\noriginal prompt: *Crocodile in a sweater*\n\n| SD1.5 | ELLA-SD1.5_fixed_token_length | ELLA-SD1.5_flexible_token_length |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_87c9ec761e68.jpg)\n\nGPT4 refined caption: *a large, textured green crocodile lying comfortably on a patch of grass with a cute, knitted orange sweater enveloping its scaly body. Around its neck, the sweater features a whimsical pattern of blue and yellow stripes. In the background, a smooth, grey rock partially obscures the view of a small pond with lily pads floating on the surface.*\n\n\n| SD1.5 | ELLA-SD1.5_fixed_token_length | ELLA-SD1.5_flexible_token_length |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_12e9e8913099.jpg)\n\n\n### 2. flexible token length\n\nDuring the training of ELLA, long synthetic captions were utilized, with the maximum number of tokens set to 128. When testing ELLA with short captions, in addition to the previously mentioned caption upsampling technique, the \"flexible_token_length\" trick can also be employed. This involves setting the tokenizer's `max_length` as `None`, thereby eliminating any text token padding or truncation. We have observed that this trick can help improve the quality of generated images corresponding to short captions.\n\n### 3. ELLA+CLIP for community models\n\nOur testing has revealed that some community models heavily reliant on trigger words may experience significant style loss when utilizing ELLA, primarily because CLIP is not used at all during ELLA inference.\n\n Although CLIP was not used during training, we have discovered that it is still possible to concatenate ELLA's input with CLIP's output during inference (Bx77x768 + Bx64x768 -> Bx141x768) as a condition for the UNet. We anticipate that using ELLA in conjunction with CLIP will better integrate with the existing community ecosystem, particularly with CLIP-specific techniques such as Textual Inversion and Trigger Word.\n \n  Our goal is to ensure better compatibility with a wider range of community models; however, we currently do not have a comprehensive set of experiences to share. If you have any suggestions, we would be grateful if you could share them in issue.\n\n### 4. FlanT5 must run in fp16 mode.\n\nAs described in [issues#23](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA\u002Fissues\u002F23), we conducted the vast majority of experiments on V100, which does not support bf16, so we had to use the fp16 T5 for training. we tested and found that the output difference between the fp16 T5 and the bf16 T5 cannot be ignored, resulting in obvious differences in the generated images. \nTherefore, it is recommended to use fp16 T5 for inference.\n\n## 📊 DPG-Bench\n\nThe guideline of DPG-Bench:\n\n1. Generate your images according to our [prompts](.\u002Fdpg_bench\u002Fprompts\u002F).\n    \n    It is recommended to generate 4 images per prompt and grid them to 2x2 format. **Please Make sure your generated image's filename is the same with the prompt's filename.**\n\n2. Run the following command to conduct evaluation.\n\n    ```bash\n    bash dpg_bench\u002Fdist_eval.sh $YOUR_IMAGE_PATH $RESOLUTION\n    ```\n\nThanks to the excellent work of [DSG](https:\u002F\u002Fgithub.com\u002Fj-min\u002FDSG) sincerely, we follow their instructions to generate questions and answers of DPG-Bench.\n\n\u003Ca id=\"emma\">\u003C\u002Fa>\n## 🚧 EMMA - Efficient Multi-Modal Adapter (Work in progress)\n\nAs described in the conclusion section of ELLA's paper  and [issue#15](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA\u002Fissues\u002F15),\nwe plan to investigate the integration of\nMLLM with diffusion models, enabling the utilization of interleaved image-text input as a conditional component in the image generation process. Here are some very early results with EMMA-SD1.5, stay tuned.\n\n\u003Ctable>\n\u003Cthead>\n  \u003Ctr>\n    \u003Cth>prompt\u003C\u002Fth>\n    \u003Cth>object image\u003C\u002Fth>\n    \u003Cth>results\u003C\u002Fth>\n  \u003C\u002Ftr>\n\u003C\u002Fthead>\n\u003Ctbody>\n  \u003Ctr>\n    \u003Ctd>A woman is skiing down a snowy mountain, wearing a bright orange ski suit and goggles.\u003C\u002Ftd>\n    \u003Ctd rowspan=\"3\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_8bb1cd4e18d9.jpg\" width=\"100%\">\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_6328f1f3137e.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>A woman is playing basketball on an outdoor court, wearing a sleeveless jersey.\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_cdeb0223e5b1.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>A woman is hiking through a dense forest, wearing a green camouflage jacket and carrying a backpack.\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_65a68122854e.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>a  dog jumping over a vehicle on a snowy day\u003C\u002Ftd>\n    \u003Ctd rowspan=\"2\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_273912edab25.jpg\" width=\"100%\">\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_429baf886ec7.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>a  dog reading a book with a pink glasses on\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_b2b69b58379e.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>A dog standing on a mountaintop, surveying the stunning view. Snow-capped peaks stretch out in the distance, and a river winds its way through the valley below.\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_25f97282b2de.jpg\" width=\"100%\">\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_b70bba34a674.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n\n## 📝 TODO\n\n- [x] release checkpoint\n- [x] release inference code\n- [x] release DPG-Bench\n\n\n## 💡 Others\n\nWe have also found [LaVi-Bridge](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.07860), another independent but similar work completed almost concurrently, which offers additional insights not covered by ELLA. The difference between ELLA and LaVi-Bridge can be found in [issue 13](https:\u002F\u002Fgithub.com\u002FELLA-Diffusion\u002FELLA\u002Fissues\u002F13). We are delighted to welcome other researchers and community users to promote the development of this field.\n\n## 😉 Citation\n\nIf you find **ELLA** useful for your research and applications, please cite us using this BibTeX:\n\n```\n@misc{hu2024ella,\n      title={ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment}, \n      author={Xiwei Hu and Rui Wang and Yixiao Fang and Bin Fu and Pei Cheng and Gang Yu},\n      year={2024},\n      eprint={2403.05135},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n","# ELLA & EMMA\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Ctd>\n      \u003Ch2> ELLA \u003C\u002Fh2>\n      \u003Cp> 论文: \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135\">ELLA: 为扩散模型配备大语言模型以增强语义对齐\u003C\u002Fa>\u003C\u002Fp>\n      \u003Cp> 项目官网: \u003Ca href=\"https:\u002F\u002Fella-diffusion.github.io\u002F\">ELLA\u003C\u002Fa> \u003C\u002Fp>\n    \u003C\u002Ftd>\n    \u003Ctd>\n      \u003Ch2> EMMA \u003C\u002Fh2>\n      \u003Cp> 论文: \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09162\">EMMA: 您的文本到图像扩散模型可以秘密地接受多模态提示\u003C\u002Fa>\u003C\u002Fp>\n      \u003Cp> 项目官网: \u003Ca href=\"https:\u002F\u002Ftencentqqgylab.github.io\u002FEMMA\u002F\">EMMA\u003C\u002Fa> \u003C\u002Fp>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## ELLA: 为扩散模型配备大语言模型以增强语义对齐\n\n\u003Cdiv align=\"center\">\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fprofile?id=~Xiwei_Hu1\">胡熙伟*\u003C\u002Fa>,\n\u003C\u002Fspan>\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fwrong.wang\u002F\">王睿*\u003C\u002Fa>,\n\u003C\u002Fspan>\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fprofile?id=~Yixiao_Fang1\">方一啸*\u003C\u002Fa>,\n\u003C\u002Fspan>\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fprofile?id=~BIN_FU2\">傅斌*\u003C\u002Fa>,\n\u003C\u002Fspan>\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fprofile?id=~Pei_Cheng1\">程培\u003C\u002Fa>,\n\u003C\u002Fspan>\n\u003Cspan class=\"author-block\">\n    \u003Ca href=\"https:\u002F\u002Fwww.skicyyu.org\u002F\">于刚&#10022\u003C\u002Fa>\n\u003C\u002Fspan>\n\u003Cp>\n* 共同第一作者，&#10022 通讯作者\n\u003C\u002Fp>\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_401e70bc3dd5.jpg\" width=\"30%\" > \u003Cbr\u002F>\n\u003Ca href='https:\u002F\u002Fella-diffusion.github.io\u002F'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-green'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2403.05135-b31b1b.svg'>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n“ELLA: 为扩散模型配备大语言模型以增强语义对齐”的官方代码。\n\u003Cp>\n\u003C\u002Fp>\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_9a2b596c16d1.png\" width=\"100%\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_496b5158c679.png\" width=\"100%\">\n\u003C\u002Fdiv>\n\n## 🌟 更改记录\n\n- **[2024.6.14]** 🔥🔥 EMMA: [技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09162), [项目官网](https:\u002F\u002Ftencentqqgylab.github.io\u002FEMMA\u002F)\n- **[2024.5.13]** EMMA 即将发布。让我们先预览一下 EMMA 的成果：[中文版](https:\u002F\u002Fwrong.wang\u002Fblog\u002F20240512-emma\u002F)，[英文版](https:\u002F\u002Fwrong.wang\u002Fblog\u002F20240512-what-is-emma\u002F)\n- **[2024.4.19]** 我们提供了 ELLA 的 ComfyUI 插件：[TencentQQGYLab\u002FComfyUI-ELLA](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FComfyUI-ELLA)\n- **[2024.4.11]** 添加了一些 [EMMA（高效多模态适配器）](#emma) 的结果\n- **[2024.4.9]** 🔥🔥🔥 发布了 [ELLA-SD1.5](https:\u002F\u002Fhuggingface.co\u002FQQGYLab\u002FELLA\u002Fblob\u002Fmain\u002Fella-sd1.5-tsc-t5xl.safetensors) 检查点！欢迎大家试用！\n- **[2024.3.11]** 🔥 发布了 DPG-Bench！欢迎大家试用！\n- **[2024.3.7]** 初始更新\n\n\n## 🚀 使用方法\n\n### 下载\n\n您可以从 [QQGYLab\u002FELLA](https:\u002F\u002Fhuggingface.co\u002FQQGYLab\u002FELLA) 下载 ELLA 模型。\n\n### 快速查看\n\n```bash\n# 在 https:\u002F\u002Fhuggingface.co\u002FQQGYLab\u002FELLA\u002Fblob\u002Fmain\u002Fella-sd1.5-tsc-t5xl.safetensors 获取 ELLA-SD1.5\n\n# 比较 ella-sd1.5 和 sd1.5\n# 将在 `.\u002Fassets\u002Fella-inference-examples` 生成图片\npython3 inference.py test --save_folder .\u002Fassets\u002Fella-inference-examples --ella_path \u002Fpath\u002Fto\u002Fella-sd1.5-tsc-t5xl.safetensors\n```\n\n### 构建一个比较 SD1.5 和 ELLA-SD1.5 的演示\n\n```python\nGRADOI_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=8082 python3 .\u002Finference.py demo \u002Fpath\u002Fto\u002Fella-sd1.5-tsc-t5xl.safetensors\n```\n\n### 在 ComfyUI 中使用 ELLA\n\n我们提供了 ELLA 的 ComfyUI 插件：[TencentQQGYLab\u002FComfyUI-ELLA](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FComfyUI-ELLA)，它支持 ControlNet、img2img 等功能。欢迎您尝试使用。\n\n感谢 [@ExponentialML](https:\u002F\u002Fgithub.com\u002FExponentialML\u002F) 和 [@kijai](https:\u002F\u002Fgithub.com\u002Fkijai)，他们为 ELLA 提供了第三方 ComfyUI 插件：\n\n1. [ExponentialML\u002FComfyUI_ELLA](https:\u002F\u002Fgithub.com\u002FExponentialML\u002FComfyUI_ELLA\u002F)\n2. [kijai\u002FComfyUI-ELLA-wrapper](https:\u002F\u002Fgithub.com\u002Fkijai\u002FComfyUI-ELLA-wrapper)\n\n## 📙 注意事项\n\nELLA 目前仍处于研究的早期阶段，我们尚未对 ELLA 的所有潜在应用进行全面测试。我们欢迎社区提出建设性的友好建议。\n\n在此，我们分享一些目前发现的关于如何更好地利用 ELLA 的技巧：\n\n### 1. 文本描述升级\n\nELLA 是使用 MLLM 标注的合成文本描述进行训练的。正如 [通过更好的文本描述提升图像生成](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fdall-e-3.pdf) 中所述，在使用 ELLA 之前对输入文本描述进行“上采样”处理，可以充分发挥其潜力。\n\n我们发现，利用大语言模型的上下文学习（ICL）能力，可以轻松实现一个文本描述升级器：\n\n示例指令：\n\n```\n请根据给定的示例，将简短的提示词扩展为较长的版本。长提示词应由 3 到 5 句话组成。长提示词必须明确描述所包含物体的颜色、形状、纹理或空间关系。切勿生成任何关于氛围的描述！！！\n\n短：一只闭着眼睛的三花猫栖息在一辆梅赛德斯车上。\n长：一只色彩斑斓的猫栖息在一辆闪亮的黑色汽车顶部。这辆车停在一栋木墙建筑前，旁边有一道绿色栅栏。汽车光洁的表面映照出车辆及周围环境的倒影。\n\n短：一个男孩坐在椅子上，手里拿着游戏手柄。\n长：一位身穿蓝色衬衫和带有字母“m”字样的棒球帽的小男孩正坐在椅子上。他脖子上挂着一枚红色奖章，手中握着一个白色的游戏控制器。在他身后还有另外两个人，其中一人背着背包。男孩右侧有一个蓝色的垃圾桶，上面贴着“自动派对”的标志。\n\n短：一名男子在水边钓鱼。\n长：一片宁静的水景，一位身穿蓝色夹克和红色毛线帽的人站在浅水中，用一根长竿钓鱼。平静的水面远处停靠着几艘帆船，背景则是一片云雾缭绕的山脉。\n\n短：一间厨房，台面杂乱，配有木质橱柜。\n长：一间光线充足的厨房，拥有木质橱柜、黑白相间的方格地砖以及侧面贴有花卉贴纸的冰箱。厨房台面上摆放着咖啡机、罐子和水果等物品。\n\n短：一只浣熊高举着一颗闪亮的红苹果\n```\n\n使用：https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen-72B-Chat-Demo\n\n得到的结果：\n\n一只顽皮的浣熊站立在后腿上，用它毛茸茸的爪子高高举起一颗鲜红的苹果。苹果在茂密森林的背景下熠熠生辉，树叶在微风中沙沙作响。浣熊脚下散落着几块石头，旁边还矗立着一棵粗糙的树干。\n\n\n#### 文本描述升级前后对比\n\n\n原始提示词：*一只浣熊高举着一颗闪亮的红苹果*\n\n| SD1.5 | ELLA-SD1.5_固定token长度 | ELLA-SD1.5_灵活token长度 |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_2ab773569540.jpg)\n\nQwen-72B 优化后的文本描述：*一只顽皮的浣熊站立在后腿上，用它毛茸茸的爪子高高举起一颗鲜红的苹果。苹果在茂密森林的背景下熠熠生辉，树叶在微风中沙沙作响。浣熊脚下散落着几块石头，旁边还矗立着一棵粗糙的树干。*\n\n\n| SD1.5 | ELLA-SD1.5_固定token长度 | ELLA-SD1.5_灵活token长度 |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_f86ea010538f.jpg)\n\n\n\n原始提示词：*穿着毛衣的鳄鱼*\n\n| SD1.5 | ELLA-SD1.5_固定token长度 | ELLA-SD1.5_灵活token长度 |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_87c9ec761e68.jpg)\n\nGPT4 优化后的文本描述：*一只体型巨大的绿色鳄鱼，皮肤布满纹理，舒适地躺在草地上，身上披着一件可爱的橙色针织毛衣。毛衣领口处有着蓝黄相间的趣味条纹图案。背景中，一块光滑的灰色岩石部分遮挡了眼前小池塘的景色，池塘表面漂浮着睡莲叶。*\n\n\n| SD1.5 | ELLA-SD1.5_固定token长度 | ELLA-SD1.5_灵活token长度 |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_12e9e8913099.jpg)\n\n\n### 2. 灵活的 token 长度\n\n在 ELLA 的训练过程中，使用了较长的合成文本描述，最大 token 数被设定为 128。当用短文本描述测试 ELLA 时，除了前面提到的文本描述上采样技术外，还可以采用“flexible_token_length”技巧。即把分词器的 `max_length` 设置为 `None`，从而取消任何文本的填充或截断操作。我们观察到，这一技巧有助于提升短文本描述对应的生成图像质量。\n\n### 3. ELLA+CLIP 用于社区模型\n\n我们的测试表明，一些严重依赖触发词的社区模型在使用 ELLA 时可能会出现显著的风格损失，主要原因是 ELLA 推理过程中完全未使用 CLIP。\n\n尽管在训练过程中并未使用 CLIP，但我们发现仍可在推理阶段将 ELLA 的输入与 CLIP 的输出拼接起来（Bx77x768 + Bx64x768 -> Bx141x768），作为 UNet 的条件输入。我们预计，将 ELLA 与 CLIP 结合使用，能够更好地融入现有的社区生态体系，尤其是与 CLIP 特有的技术，如文本反转和触发词等。\n我们的目标是确保 ELLA 能够与更广泛的社区模型兼容；然而，目前我们尚无全面的经验可供分享。如果您有任何建议，请在 issue 中告知我们，我们将不胜感激。\n\n### 4. FlanT5 必须以 fp16 模式运行。\n\n如 [issues#23](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA\u002Fissues\u002F23) 所述，我们绝大多数实验都在 V100 上进行，而 V100 不支持 bf16，因此我们不得不使用 fp16 版本的 T5 进行训练。经过测试发现，fp16 版本的 T5 和 bf16 版本的 T5 在输出上的差异不容忽视，这导致生成的图像也存在明显区别。\n因此，建议在推理时使用 fp16 版本的 T5。\n\n## 📊 DPG-Bench\n\nDPG-Bench 的使用指南：\n\n1. 请根据我们的 [提示词](.\u002Fdpg_bench\u002Fprompts\u002F) 生成图像。\n    建议每个提示词生成 4 张图像，并以 2x2 的格式排列。**请确保您生成的图像文件名与提示词文件名一致。**\n\n2. 运行以下命令进行评估。\n\n    ```bash\n    bash dpg_bench\u002Fdist_eval.sh $YOUR_IMAGE_PATH $RESOLUTION\n    ```\n\n衷心感谢 [DSG](https:\u002F\u002Fgithub.com\u002Fj-min\u002FDSG) 的出色工作，我们遵循他们的指导，生成了 DPG-Bench 的问答内容。\n\n\u003Ca id=\"emma\">\u003C\u002Fa>\n\n## 🚧 EMMA - 高效多模态适配器（正在进行中）\n\n正如 ELLA 论文的结论部分以及 [issue#15](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA\u002Fissues\u002F15) 所述，\n我们计划研究将 MLLM 与扩散模型相结合，从而在图像生成过程中利用交错的图文输入作为条件组件。以下是使用 EMMA-SD1.5 的一些非常早期的结果，敬请期待。\n\n\u003Ctable>\n\u003Cthead>\n  \u003Ctr>\n    \u003Cth>提示词\u003C\u002Fth>\n    \u003Cth>物体图像\u003C\u002Fth>\n    \u003Cth>结果\u003C\u002Fth>\n  \u003C\u002Ftr>\n\u003C\u002Fthead>\n\u003Ctbody>\n  \u003Ctr>\n    \u003Ctd>一位女士正穿着亮橙色滑雪服和护目镜，在雪山上滑雪。\u003C\u002Ftd>\n    \u003Ctd rowspan=\"3\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_8bb1cd4e18d9.jpg\" width=\"100%\">\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_6328f1f3137e.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>一位女士身穿无袖球衣，在室外球场上打篮球。\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_cdeb0223e5b1.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>一位女士穿着绿色迷彩夹克，背着背包，在茂密的森林中徒步旅行。\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_65a68122854e.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>一只狗在雪天跳过一辆汽车\u003C\u002Ftd>\n    \u003Ctd rowspan=\"2\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_273912edab25.jpg\" width=\"100%\">\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_429baf886ec7.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>一只戴着粉红色眼镜的狗正在看书\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_b2b69b58379e.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>一只狗站在山顶上，俯瞰壮丽的景色。远处是白雪皑皑的山峰，山谷间有一条蜿蜒的河流。\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_25f97282b2de.jpg\" width=\"100%\">\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_readme_b70bba34a674.jpg\" width=\"100%\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n\n## 📝 待办事项\n\n- [x] 发布检查点\n- [x] 发布推理代码\n- [x] 发布 DPG-Bench\n\n\n## 💡 其他\n\n我们还发现了 [LaVi-Bridge](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.07860)，这是一项独立但相似的工作，几乎同时完成，提供了 ELLA 未涵盖的额外见解。ELLA 和 LaVi-Bridge 之间的区别可以在 [issue 13](https:\u002F\u002Fgithub.com\u002FELLA-Diffusion\u002FELLA\u002Fissues\u002F13) 中找到。我们非常欢迎其他研究人员和社区用户共同推动该领域的发展。\n\n## 😉 引用\n\n如果您发现 **ELLA** 对您的研究和应用有所帮助，请使用以下 BibTeX 格式引用我们：\n\n```\n@misc{hu2024ella,\n      title={ELLA: 为扩散模型配备 LLM 以增强语义对齐}, \n      author={Xiwei Hu and Rui Wang and Yixiao Fang and Bin Fu and Pei Cheng and Gang Yu},\n      year={2024},\n      eprint={2403.05135},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```","# ELLA 快速上手指南\n\nELLA (Equip Diffusion Models with LLM) 是一个将大型语言模型（LLM）与扩散模型结合的工具，旨在通过增强语义对齐来提升文生图的质量。本指南帮助您快速部署并运行 ELLA-SD1.5。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+) 或 macOS\n*   **Python**: 3.8 或更高版本\n*   **GPU**: 支持 CUDA 的 NVIDIA 显卡 (建议显存 16GB 以上，训练实验多在 V100 上进行)\n*   **依赖库**: PyTorch, Transformers, Diffusers, Gradio 等\n\n**前置依赖安装：**\n建议使用虚拟环境（如 conda 或 venv），并安装基础深度学习库。国内用户可使用清华源加速安装：\n\n```bash\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\npip install -r requirements.txt\n# 如果项目中没有 requirements.txt，请手动安装核心依赖：\npip install transformers diffusers accelerate gradio safetensors\n```\n\n> **注意**：根据官方说明，FlanT5 模型必须在 `fp16` 模式下运行以获得最佳效果，请勿使用 `bf16`。\n\n## 安装步骤\n\n### 1. 克隆项目代码\n从 GitHub 获取官方源代码：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA.git\ncd ELLA\n```\n\n### 2. 下载模型权重\n您需要从 Hugging Face 下载 ELLA 的检查点文件。\n*   **模型名称**: `ella-sd1.5-tsc-t5xl.safetensors`\n*   **下载地址**: [QQGYLab\u002FELLA](https:\u002F\u002Fhuggingface.co\u002FQQGYLab\u002FELLA)\n\n**国内加速下载方案：**\n如果直接连接 Hugging Face 困难，推荐使用镜像站或代理：\n*   使用 `hf-mirror`：\n    ```bash\n    export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n    huggingface-cli download --resume-download QQGYLab\u002FELLA --local-dir .\u002Fmodels\n    ```\n*   或者手动下载后，将文件放置在项目目录中，记下其绝对路径。\n\n## 基本使用\n\n### 方式一：命令行推理测试\n这是最简单的验证方式，用于对比原生 SD1.5 与 ELLA-SD1.5 的生成效果。\n\n```bash\n# 请将 \u002Fpath\u002Fto\u002Fella-sd1.5-tsc-t5xl.safetensors 替换为您实际的模型文件路径\npython3 inference.py test --save_folder .\u002Fassets\u002Fella-inference-examples --ella_path \u002Fpath\u002Fto\u002Fella-sd1.5-tsc-t5xl.safetensors\n```\n运行后，生成的图片将保存在 `.\u002Fassets\u002Fella-inference-examples` 目录下。\n\n### 方式二：启动 Web UI 演示\n启动一个 Gradio 界面，方便交互式输入提示词并实时查看生成结果：\n\n```bash\n# 设置服务地址和端口，然后启动 demo\nGRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=8082 python3 .\u002Finference.py demo \u002Fpath\u002Fto\u002Fella-sd1.5-tsc-t5xl.safetensors\n```\n启动成功后，在浏览器访问 `http:\u002F\u002Flocalhost:8082` 即可使用。\n\n### 💡 使用技巧：提示词优化 (Caption Upscale)\nELLA 在使用长描述性提示词时表现更佳。建议在输入前利用大模型（如 Qwen-72B 或 GPT-4）将简短提示词扩展为包含颜色、形状、纹理和空间关系的 3-5 句详细描述。\n\n**示例指令：**\n> \"Please generate the long prompt version of the short one... Long prompt version should consist of 3 to 5 sentences. Long prompt version must specify the color, shape, texture or spatial relation of the included objects.\"\n\n此外，对于短提示词，可在代码中将 tokenizer 的 `max_length` 设置为 `None` 以启用灵活 token 长度策略，从而提升生成质量。\n\n### 方式三：ComfyUI 集成\n如果您习惯使用 ComfyUI，官方及社区提供了插件支持（含 ControlNet 等功能）：\n*   官方插件：[TencentQQGYLab\u002FComfyUI-ELLA](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FComfyUI-ELLA)\n*   社区插件：[ExponentialML\u002FComfyUI_ELLA](https:\u002F\u002Fgithub.com\u002FExponentialML\u002FComfyUI_ELLA\u002F)","某独立游戏开发者正在为奇幻 RPG 项目批量生成具有复杂剧情互动的 NPC 角色立绘，需要精准还原剧本中细腻的动作与情感描述。\n\n### 没有 ELLA 时\n- **语义理解偏差**：传统扩散模型难以解析长难句，常忽略提示词中“一边擦拭血迹一边颤抖”这类复合动作细节，导致生成的角色姿态僵硬或动作缺失。\n- **逻辑关联断裂**：当提示词包含因果逻辑（如“因愤怒而捏碎酒杯”）时，模型往往只画出愤怒表情或碎杯子，无法呈现两者间的动态联系。\n- **反复试错成本高**：为了修正语义对齐问题，开发者需手动拆解提示词并尝试数十次随机种子，耗费大量时间筛选可用素材。\n- **抽象概念具象化困难**：对于“被诅咒的忧郁眼神”等抽象描述，模型倾向于生成通用悲伤表情，缺乏独特的故事氛围感。\n\n### 使用 ELLA 后\n- **深层语义精准捕捉**：ELLA 内置的大语言模型能深度理解复杂句式，准确生成角色“擦拭血迹且手部颤抖”的精细动作，画面叙事性显著增强。\n- **逻辑关系完美呈现**：模型能推理提示词中的因果链条，生动描绘出角色因愤怒发力而捏碎酒杯的瞬间张力，人物情绪与动作高度统一。\n- **生成效率大幅提升**：开发者只需输入一段完整的剧本描述，ELLA 即可一次性输出高匹配度图像，将单张角色的调试时间从半小时缩短至几分钟。\n- **抽象氛围生动还原**：借助 LLM 的联想能力，ELLA 能将“被诅咒的忧郁”转化为独特的光影与微表情组合，直接产出符合世界观设定的高质量立绘。\n\nELLA 通过引入大语言模型强化语义对齐，让文生图工具真正读懂了人类复杂的创作意图，将“抽卡式”生成转变为可控的创意落地过程。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencentQQGYLab_ELLA_2ab77356.jpg","TencentQQGYLab","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FTencentQQGYLab_b9b7d664.png",null,"https:\u002F\u002Fgithub.com\u002FTencentQQGYLab",[80,84,88],{"name":81,"color":82,"percentage":83},"Python","#3572A5",59.5,{"name":85,"color":86,"percentage":87},"Jupyter Notebook","#DA5B0B",39.8,{"name":89,"color":90,"percentage":91},"Shell","#89e051",0.6,1281,65,"2026-03-30T19:07:47","Apache-2.0","未说明","需要 NVIDIA GPU (基于 V100 实验)，必须支持 FP16 精度，不支持 BF16",{"notes":99,"python":100,"dependencies":101},"1. FlanT5 模型必须在 fp16 模式下运行，使用 bf16 会导致生成图像出现明显差异。2. 建议使用大语言模型（如 Qwen-72B 或 GPT-4）对输入提示词进行‘扩写’（Caption Upscale）以发挥最佳效果。3. 对于短提示词，建议将 tokenizer 的 max_length 设为 None 以启用灵活令牌长度策略。4. 部分依赖触发词的社区模型可能需要结合 CLIP 使用以避免风格丢失。","3.x (通过 python3 命令推断)",[102,103,104],"FlanT5 (必须运行在 fp16 模式)","Gradio (用于演示)","ComfyUI (可选插件支持)",[14,37],"2026-03-27T02:49:30.150509","2026-04-06T09:52:02.460896",[109,114,119,124,129,134],{"id":110,"question_zh":111,"answer_zh":112,"source_url":113},10234,"ELLA 模型的权重和推理代码何时开源？SDXL 版本会发布吗？","ELLA-SD1.5 版本已发布，欢迎试用。但官方明确表示 SDXL 版本不会发布（\"SDXL version will not be released\"）。尽管社区用户多次请求因技术迭代（如 Flux, HiDream 等出现）而重新考虑开源 SDXL 权重，但截至目前维护者未改变此决定。建议用户直接使用已发布的 SD1.5 版本进行研究或应用。","https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA\u002Fissues\u002F16",{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},10235,"为什么 ELLA 生成的图像质量不如原始模型？如何改善？","ELLA 虽然增强了文本理解能力，但在处理复杂提示词时可能导致图像质量下降。解决方案包括：1. 优化提示词（Prompt Refinement），将复杂的描述拆解为更具体、结构化的句子（例如明确人物位置、衣着细节等）；2. 社区建议尝试仅训练 LoRA 而不是全量微调 U-Net，以类似 LCM LoRA 的方式配合 ELLA 使用，这可能有助于平衡文本控制力与图像生成质量。","https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA\u002Fissues\u002F20",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},10236,"在 SDXL 上复现 ELLA 时，如何处理 pooled_prompt_embedding 以及解决模型过拟合导致的图像过饱和问题？","在 SDXL 架构中，pooled_prompt_embedding 仅来自第二个 CLIP 文本编码器（尺寸为 1*1280）。针对训练时出现的过拟合及图像过饱和问题，维护者分享的经验是：基于 diffusers 的 SDXL 训练脚本进行构建通常可以正常工作，且长提示词跟随效果会有提升。如果遇到收敛困难，需注意正则化策略，确保模型不过度训练。目前社区实践表明，只要正确对齐嵌入尺寸并遵循标准训练流程，可以避免严重的过饱和现象。","https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA\u002Fissues\u002F39",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},10237,"论文中提到使用 AdaLN，但代码中似乎是 AdaLN-Zero，这是怎么回事？初始化策略是什么？","这是一个命名与实现细节的澄清。代码中实现的确实是 AdaLN-Zero 机制。这里的 \"Zero\" 指的是将残差路径（residual path）的作用初始化为 0，而不是权重初始化为零向量。具体做法是将 AdaLN 中的线性参数（linear parameters）初始化为 0，以确保初始状态下 AdaLN 不受调制特征影响，随后逐渐增加其影响力。代码示例参考 diffusers 库：`nn.init.zeros_(self.linear.weight)` 和 `nn.init.zeros_(self.linear.bias)`。这种初始化使得 DiT 块在开始时近似为恒等函数。","https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA\u002Fissues\u002F51",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},10238,"训练数据对模型效果有何影响？直接微调 SD-1.5 的效果如何？","维护者进行了对比实验：使用相同的训练数据集和超参数（训练 140,000 步，约一个 epoch），分别对集成 T5-XL 和 TSC 的 ELLA-SD1.5 以及完整的 SD v1.5 U-Net 进行微调。实验结果显示，虽然直接微调 SD-1.5 也能从数据中受益，但 ELLA 架构在特定基准（如 T2I-CompBench）上的表现更为突出。具体的性能对比图表已在 Issue 中展示，证明引入额外文本编码器架构的有效性。","https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA\u002Fissues\u002F10",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},10239,"如何获取 ELLA 模型的检查点文件？","ELLA-SD1.5 模型检查点已正式发布。用户可以直接在项目的 Release 页面或相关提交记录中找到下载链接。早期的检查点文件大小约为 134 MB。请查阅项目主页的最新公告以获取最直接的下载地址和使用说明。","https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA\u002Fissues\u002F3",[]]