[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-ML-GSAI--LLaDA":3,"tool-ML-GSAI--LLaDA":61},[4,19,28,37,45,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":18},10095,"AutoGPT","Significant-Gravitas\u002FAutoGPT","AutoGPT 是一个旨在让每个人都能轻松使用和构建 AI 的强大平台，核心功能是帮助用户创建、部署和管理能够自动执行复杂任务的连续型 AI 智能体。它解决了传统 AI 应用中需要频繁人工干预、难以自动化长流程工作的痛点，让用户只需设定目标，AI 即可自主规划步骤、调用工具并持续运行直至完成任务。\n\n无论是开发者、研究人员，还是希望提升工作效率的普通用户，都能从 AutoGPT 中受益。开发者可利用其低代码界面快速定制专属智能体；研究人员能基于开源架构探索多智能体协作机制；而非技术背景用户也可直接选用预置的智能体模板，立即投入实际工作场景。\n\nAutoGPT 的技术亮点在于其模块化“积木式”工作流设计——用户通过连接功能块即可构建复杂逻辑，每个块负责单一动作，灵活且易于调试。同时，平台支持本地自托管与云端部署两种模式，兼顾数据隐私与使用便捷性。配合完善的文档和一键安装脚本，即使是初次接触的用户也能在几分钟内启动自己的第一个 AI 智能体。AutoGPT 正致力于降低 AI 应用门槛，让人人都能成为 AI 的创造者与受益者。",183572,3,"2026-04-20T04:47:55",[13,14,15,16,17],"Agent","语言模型","插件","开发框架","图像","ready",{"id":20,"name":21,"github_repo":22,"description_zh":23,"stars":24,"difficulty_score":25,"last_commit_at":26,"category_tags":27,"status":18},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",161147,2,"2026-04-19T23:31:47",[16,13,14],{"id":29,"name":30,"github_repo":31,"description_zh":32,"stars":33,"difficulty_score":34,"last_commit_at":35,"category_tags":36,"status":18},10072,"DeepSeek-V3","deepseek-ai\u002FDeepSeek-V3","DeepSeek-V3 是一款由深度求索推出的开源混合专家（MoE）大语言模型，旨在以极高的效率提供媲美顶尖闭源模型的智能服务。它拥有 6710 亿总参数，但在处理每个 token 时仅激活 370 亿参数，这种设计巧妙解决了大规模模型推理成本高、速度慢的难题，让高性能 AI 更易于部署和应用。\n\n这款模型特别适合开发者、研究人员以及需要构建复杂 AI 应用的企业团队使用。无论是进行代码生成、逻辑推理还是多轮对话开发，DeepSeek-V3 都能提供强大的支持。其独特之处在于采用了无辅助损失的负载均衡策略和多令牌预测训练目标，前者在提升计算效率的同时避免了性能损耗，后者则显著增强了模型表现并加速了推理过程。此外，模型在 14.8 万亿高质量令牌上完成预训练，且整个训练过程异常稳定，未出现不可恢复的损失尖峰。凭借仅需 278.8 万 H800 GPU 小时即可完成训练的高效特性，DeepSeek-V3 为开源社区树立了一个兼顾性能与成本效益的新标杆。",102693,5,"2026-04-20T03:58:04",[14],{"id":38,"name":39,"github_repo":40,"description_zh":41,"stars":42,"difficulty_score":10,"last_commit_at":43,"category_tags":44,"status":18},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[14,17,13,16],{"id":46,"name":47,"github_repo":48,"description_zh":49,"stars":50,"difficulty_score":25,"last_commit_at":51,"category_tags":52,"status":18},8553,"spec-kit","github\u002Fspec-kit","Spec Kit 是一款专为提升软件开发效率而设计的开源工具包，旨在帮助团队快速落地“规格驱动开发”（Spec-Driven Development）模式。传统开发中，需求文档往往与代码实现脱节，导致沟通成本高且结果不可控；而 Spec Kit 通过将规格说明书转化为可执行的指令，让 AI 直接依据明确的业务场景生成高质量代码，从而减少从零开始的随意编码，确保产出结果的可预测性。\n\n该工具特别适合希望利用 AI 辅助编程的开发者、技术负责人及初创团队。无论是启动全新项目还是在现有工程中引入规范化流程，用户只需通过简单的命令行操作，即可初始化项目并集成主流的 AI 编程助手。其核心技术亮点在于“规格即代码”的理念，支持社区扩展与预设模板，允许用户根据特定技术栈定制开发流程。此外，Spec Kit 强调官方维护的安全性，提供稳定的版本管理，帮助开发者在享受 AI 红利的同时，依然牢牢掌握架构设计的主动权，真正实现从“凭感觉写代码”到“按规格建系统”的转变。",88749,"2026-04-17T09:48:14",[14,17,13,16],{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":25,"last_commit_at":59,"category_tags":60,"status":18},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[16,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":76,"owner_url":77,"languages":78,"stars":100,"forks":101,"last_commit_at":102,"license":103,"difficulty_score":10,"env_os":104,"env_gpu":105,"env_ram":104,"env_deps":106,"category_tags":112,"github_topics":76,"view_count":25,"oss_zip_url":76,"oss_zip_packed_at":76,"status":18,"created_at":114,"updated_at":115,"faqs":116,"releases":145},10089,"ML-GSAI\u002FLLaDA","LLaDA","Official PyTorch implementation for \"Large Language Diffusion Models\"","LLaDA 是一款基于扩散模型架构的大型语言模型，旨在探索生成式 AI 的新路径。与传统依赖自回归机制（逐字预测）的模型不同，LLaDA 采用“掩码扩散”技术，通过迭代去噪的方式生成文本。这种独特的机制有效解决了传统模型在长文本生成中容易出现的错误累积问题，并显著提升了内容生成的多样性与可控性。\n\n作为首个从零训练且规模达 80 亿参数的扩散语言模型，LLaDA 在性能上已能与主流的 LLaMA3 8B 模型相媲美。其技术亮点不仅在于基础架构的创新，还延伸至多模态领域（如 LLaDA-V）以及高效的混合专家架构（MoE 版本），后者能在推理时仅激活少量参数即可实现超越稠密模型的效果。\n\nLLaDA 非常适合 AI 研究人员、算法工程师及大模型开发者使用。对于希望深入探究非自回归生成机制、进行前沿模型对比实验，或寻求在特定场景下获得更高生成多样性的技术团队而言，这是一个极具价值的开源项目。普通用户也可通过集成的 Gradio 演示界面，直观体验这一新型对话模型的交互能力。","# Large Language Diffusion Models\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-red.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09992)\n[![deploy](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-LLaDA_Base-FFEB3B)](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base)\n[![deploy](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-LLaDA_Instruct-FFEB3B)](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct)\n[![deploy](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-Demo-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmultimodalart\u002FLLaDA)\n[![deploy](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FZhihu1-知乎1-blue)](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F24214732238)\n[![deploy](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FZhihu2-知乎2-blue)](https:\u002F\u002Fwww.zhihu.com\u002Fquestion\u002F1908479621466396378\u002Fanswer\u002F1910672718174589774?share_code=1kreOq5gzOtnM&utm_psn=1910708245535912148&utm_source=wechat_timeline&utm_medium=social&s_r=0)\n\n## News\n### New works\n- [2025.02.14] We have uploaded LLaDA paper to [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09992) and open-sourced [LLaDA-8B-Base](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base) and [LLaDA-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct).\n\n- [2025.05.23] We introduce [LLaDA-V](https:\u002F\u002Fml-gsai.github.io\u002FLLaDA-V-demo\u002F), a competitive diffusion-based vision-language model, outperforming other diffusion MLLMs.\n\n- [2025.05.25] We introduce [LLaDA 1.5](https:\u002F\u002Fml-gsai.github.io\u002FLLaDA-1.5-Demo\u002F), which incorporates VRPO to reduce gradient variance and enhance preference alignment in LLaDA.\n\n- [2025.09.11] We introduce [LLaDA-MoE-7B-A1B-Base](https:\u002F\u002Fhuggingface.co\u002FinclusionAI\u002FLLaDA-MoE-7B-A1B-Base) and [LLaDA-MoE-7B-A1B-Instruct](https:\u002F\u002Fhuggingface.co\u002FinclusionAI\u002FLLaDA-MoE-7B-A1B-Instruct), the first diffusion language model pretrained from scratch with MoE architecture. LLaDA-MoE-7B-A1B-Instruct uses only ~1B active parameters at inference while surpassing LLaDA 1.5(an 8B dense model), and comparable to Qwen2.5-3B-Instruct.\n\n\n### New features in this repo\n- [2025.05.04] We have provided evaluation code based on the [lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) for the LLaDA-8B-Base.\n\n- [2025.10.27] We have provided batch inference support, along with all evaluation code for [LLaDA-8B-Base](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base), [LLaDA-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct) and [LLaDA 1.5](https:\u002F\u002Fml-gsai.github.io\u002FLLaDA-1.5-Demo\u002F).\n\n  \n## Introduction\nWe introduce LLaDA (\u003Cb>L\u003C\u002Fb>arge \u003Cb>La\u003C\u002Fb>nguage \u003Cb>D\u003C\u002Fb>iffusion with m\u003Cb>A\u003C\u002Fb>sking), a diffusion model with an unprecedented 8B scale, trained entirely from scratch, \nrivaling LLaMA3 8B in performance.\n\n\u003Cdiv style=\"display: flex; justify-content: center; flex-wrap: wrap;\">\n    \u003Cimg src=\".\u002Fimgs\u002FLLaDA_vs_LLaMA.svg\" style=\"width: 45%\" \u002F>\n    \u003Cimg src=\".\u002Fimgs\u002FLLaDA_vs_LLaMA_chat.svg\" style=\"width: 46%\" \u002F>\n\u003C\u002Fdiv>\n\n\n## Inference\nThe [LLaDA-8B-Base](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base) and [LLaDA-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct) are uploaded\nin Huggingface. Please first install `transformers==4.38.2` and employ the [transformers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex) to load.\n\n```angular2html\nfrom transformers import AutoModel, AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained('GSAI-ML\u002FLLaDA-8B-Base', trust_remote_code=True)\nmodel = AutoModel.from_pretrained('GSAI-ML\u002FLLaDA-8B-Base', trust_remote_code=True, torch_dtype=torch.bfloat16)\n```\n\nWe provide `get_log_likelihood()` and `generate()` functions in `get_log_likelihood.py` \nand `generate.py` respectively, for conditional likelihood evaluation and conditional generation.\n\nYou can directly run `python chat.py` to have multi-round conversations with LLaDA-8B-Instruct.\n\n\n## Gradio demo \nThank you very much to [apolinário](https:\u002F\u002Fgithub.com\u002Fapolinario) for helping us create this amazing demo!\n\nFirst, install [Gradio](https:\u002F\u002Fwww.gradio.app) `pip install gradio`, and then you can directly run `python app.py`\n\n\u003Cdiv style=\"display: flex; justify-content: center; flex-wrap: wrap;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FML-GSAI_LLaDA_readme_1578591293b3.gif\" style=\"width: 80%\" \u002F>\n\u003C\u002Fdiv>\n\n## Pre-training and Supervised Fine-Tuning\n\nWe will not provide the training framework and data as most open-source LLMs do.\n\nHowever, the pre-training and Supervised Fine-Tuning of LLaDA are straightforward. If \nyou have a codebase for training an autoregressive model, you can modify it to \nadapt to LLaDA with just a few lines of code.\n\nWe provide guidelines for the pre-training and SFT of LLaDA in [GUIDELINES.md](GUIDELINES.md). \nYou can also refer to [SMDM](https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FSMDM), which has a similar training process to LLaDA \nand has open-sourced the training framework.\n\n## Evaluation\nPlease refer to [EVAL.md](EVAL.md) for instructions on using the evaluation code.\n\n## FAQ\nHere, we address some common questions about LLaDA.\n\n### 0. How do I train my own LLaDA?\nPlease refer to [GUIDELINES.md](GUIDELINES.md) for the guidelines. \nYou can also refer to [SMDM](https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FSMDM), which follows the same training \nprocess as LLaDA and has open-sourced its code.\n\n\n### 1. What is the difference between LLaDA and BERT?\n\nOur motivation is not to improve BERT, nor to apply image generation methods like [MaskGIT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.04200) \nto text. **Our goal is to explore a theoretically complete language modeling approach — masked diffusion models.** \nDuring this process, we simplified the approach and discovered that the loss function of masked diffusion models \nis related to the loss functions of BERT and MaskGIT. You can find our theoretical research process in Question 7.\n\nSpecifically, LLaDA employs a masking ratio that varies randomly between 0 and 1, while BERT uses \na fixed ratio. This subtle difference has significant implications. **The training\nobjective of LLaDA is an upper bound on the negative log-likelihood of the model \ndistribution, making LLaDA a generative model.** This enables LLaDA to naturally \nperform in-context learning, instruction-following, and ensures Fisher consistency \nfor scalability with large datasets and models. You can also find a direct answer \nto this question in Section 2.1 of our paper.\n\n\n### 2. What is the relationship between LLaDA and Transformer?\nNetwork structure and probabilistic modeling are two distinct approaches that collectively form the \nfoundation of language models. LLaDA, like GPT, adopts the \nTransformer architecture. The key difference lies in the probabilistic modeling approach: GPT \nutilizes an autoregressive next-token prediction method, \nwhile LLaDA employs a diffusion model for probabilistic modeling.\n\n\n### 3. What is the sampling efficiency of LLaDA?\nCurrently, LLaDA's sampling speed is slower than the autoregressive baseline for three reasons: \n1. LLaDA samples with a fixed context length;\n2. LLaDA cannot yet leverage techniques like KV-Cache;\n3. LLaDA achieves optimal performance when the number of sampling steps equals the response length.\nReducing the number of sampling steps leads to a decrease in performance, as detailed in Appendix B.4 \nand Appendix B.6 of our paper.\n\nIn this work, we aim to explore the upper limits of LLaDA's capabilities, **challenging the assumption \nthat the key LLM abilities are inherently tied to autoregressive models**. We will continue \nto optimize its efficiency in the future. We believe this research approach is reasonable, \nas verifying the upper limits of diffusion language models' capabilities will provide us with\nmore resources and sufficient motivation to optimize efficiency.\n\nRecall the development of diffusion models for images, from [DDPM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.11239) \nto the [Consistency model](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.11081), where sampling speed accelerated nearly \n1000 times over the course of 4 years. **We believe there is significant room for optimization in LLaDA's \nsampling efficiency as well**. Current solutions, including [block diffusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09573), can mitigate the fixed context length issue, and \n[consistency distillation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.05415) can reduce the number of sampling steps. In\naddition, some cache methods (e.g., [Fast-dllm](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFast-dLLM), [dllm-cache](https:\u002F\u002Fgithub.com\u002Fmaomaocun\u002FdLLM-cache))\ncan also be adapted by LLaDA.\n\n\n### 4. What is the training stability of LLaDA?\nFor details on the pre-training process of LLaDA, please refer to Section 2.2 of our paper. \nDuring the total pre-training on 2.3T tokens, we encountered a training crash (loss becoming NaN) \nonly once at 1.2T tokens. Our solution was to resume the checkpoint and reduce \nthe learning rate from 4e-4 to 1e-4.\n\n\n### 5. Why is the final answer \"72\" generated earlier than the intermediate calculation step (e.g., 12 × 4 = 48) in Tab4?\n\n**The mask predictor has successfully predicted the reasoning process. However, during the \nremasking process, the reasoning steps are masked out again.** As shown in the figure \nbelow, the non-white background represents the model's generation process, while the \nwhite-background boxes indicate the predictions made by the mask predictor at each step. \nWe adopt a randomly remasking strategy.\n\n\u003Cdiv style=\"display: flex; justify-content: center; flex-wrap: wrap;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FML-GSAI_LLaDA_readme_4bcea1fc370a.gif\" style=\"width: 80%\" \u002F>\n\u003C\u002Fdiv>\n\n### 6. Why does LLaDA answer 'Bailing' when asked 'Who are you'?\nThis is because our pre-training and SFT data were designed for training an autoregressive model, \nwhereas LLaDA directly utilizes data that contains identity markers.\n\n\n### 7. Our journey in developing LLaDA?\nLLaDA is built upon our two prior works, [RADD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.03736) and \n[SMDM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18514). \n\nRADD demonstrated that the **training objective of LLaDA serves as an upper bound on the negative \nlog-likelihood** of the model’s distribution, a conclusion also supported by [MD4](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04329) \nand [MDLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07524). \nFurthermore, RADD was the first to theoretically prove that **masked diffusion models do not require time t \nas an input to Transformer**. This insight provides the theoretical \njustification for LLaDA’s unmodified use of the Transformer architecture. Lastly, \nRADD showed that **the training objective of masked diffusion models is equivalent to that of \nany-order autoregressive models**, offering valuable insights into how masked diffusion models can \novercome the reversal curse.\n\nSMDM introduces the first **scaling law** for masked diffusion models and demonstrates that, with the \nsame model size and training data, masked diffusion models can achieve downstream benchmark results \non par with those of autoregressive models. Additionally, SMDM presents a simple, **unsupervised \nclassifier-free guidance** method that greatly improves downstream benchmark performance, which has \nbeen adopted by LLaDA.\n\n\n## Citation\n\n```bibtex\n@article{nie2025large,\n  title={Large Language Diffusion Models},\n  author={Nie, Shen and Zhu, Fengqi and You, Zebin and Zhang, Xiaolu and Ou, Jingyang and Hu, Jun and Zhou, Jun and Lin, Yankai and Wen, Ji-Rong and Li, Chongxuan},\n  journal={arXiv preprint arXiv:2502.09992},\n  year={2025}\n}\n```\n","# 大型语言扩散模型\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-red.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09992)\n[![部署](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-LLaDA_Base-FFEB3B)](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base)\n[![部署](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-LLaDA_Instruct-FFEB3B)](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct)\n[![部署](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-Demo-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmultimodalart\u002FLLaDA)\n[![部署](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FZhihu1-知乎1-blue)](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F24214732238)\n[![部署](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FZhihu2-知乎2-blue)](https:\u002F\u002Fwww.zhihu.com\u002Fquestion\u002F1908479621466396378\u002Fanswer\u002F1910672718174589774?share_code=1kreOq5gzOtnM&utm_psn=1910708245535912148&utm_source=wechat_timeline&utm_medium=social&s_r=0)\n\n## 新闻\n### 新成果\n- [2025.02.14] 我们已将 LLaDA 论文上传至 [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09992)，并开源了 [LLaDA-8B-Base](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base) 和 [LLaDA-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct)。\n\n- [2025.05.23] 我们推出了 [LLaDA-V](https:\u002F\u002Fml-gsai.github.io\u002FLLaDA-V-demo\u002F)，这是一款具有竞争力的基于扩散的视觉-语言模型，性能优于其他扩散型多模态大语言模型。\n\n- [2025.05.25] 我们推出了 [LLaDA 1.5](https:\u002F\u002Fml-gsai.github.io\u002FLLaDA-1.5-Demo\u002F)，该模型结合了 VRPO 方法以降低梯度方差，并进一步提升偏好对齐效果。\n\n- [2025.09.11] 我们推出了 [LLaDA-MoE-7B-A1B-Base](https:\u002F\u002Fhuggingface.co\u002FinclusionAI\u002FLLaDA-MoE-7B-A1B-Base) 和 [LLaDA-MoE-7B-A1B-Instruct](https:\u002F\u002Fhuggingface.co\u002FinclusionAI\u002FLLaDA-MoE-7B-A1B-Instruct)，这是首个从头开始预训练的采用 MoE 架构的扩散语言模型。LLaDA-MoE-7B-A1B-Instruct 在推理时仅使用约 10 亿活跃参数，却超越了 LLaDA 1.5（一个 80 亿参数的密集模型），且性能与 Qwen2.5-3B-Instruct 相当。\n\n\n### 本仓库新增功能\n- [2025.05.04] 我们基于 [lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) 提供了针对 LLaDA-8B-Base 的评估代码。\n\n- [2025.10.27] 我们增加了批量推理支持，并提供了适用于 [LLaDA-8B-Base](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base)、[LLaDA-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct) 和 [LLaDA 1.5](https:\u002F\u002Fml-gsai.github.io\u002FLLaDA-1.5-Demo\u002F) 的完整评估代码。\n\n\n## 简介\n我们推出 LLaDA（\u003Cb>L\u003C\u002Fb>arge \u003Cb>La\u003C\u002Fb>nguage \u003Cb>D\u003C\u002Fb>iffusion with m\u003Cb>A\u003C\u002Fb>sking），这是一种规模空前的 80 亿参数扩散模型，完全从零开始训练，其性能可与 LLaMA3 8B 相媲美。\n\n\u003Cdiv style=\"display: flex; justify-content: center; flex-wrap: wrap;\">\n    \u003Cimg src=\".\u002Fimgs\u002FLLaDA_vs_LLaMA.svg\" style=\"width: 45%\" \u002F>\n    \u003Cimg src=\".\u002Fimgs\u002FLLaDA_vs_LLaMA_chat.svg\" style=\"width: 46%\" \u002F>\n\u003C\u002Fdiv>\n\n\n## 推理\n[LLaDA-8B-Base](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base) 和 [LLaDA-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct) 已上传至 Huggingface。请先安装 `transformers==4.38.2`，并使用 [transformers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex) 库进行加载。\n\n```angular2html\nfrom transformers import AutoModel, AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained('GSAI-ML\u002FLLaDA-8B-Base', trust_remote_code=True)\nmodel = AutoModel.from_pretrained('GSAI-ML\u002FLLaDA-8B-Base', trust_remote_code=True, torch_dtype=torch.bfloat16)\n```\n\n我们在 `get_log_likelihood.py` 和 `generate.py` 中分别提供了 `get_log_likelihood()` 和 `generate()` 函数，用于条件似然评估和条件生成。\n\n您也可以直接运行 `python chat.py`，与 LLaDA-8B-Instruct 进行多轮对话。\n\n\n## Gradio 演示\n非常感谢 [apolinário](https:\u002F\u002Fgithub.com\u002Fapolinario) 帮助我们打造了这个精彩的演示！\n\n首先，请安装 [Gradio](https:\u002F\u002Fwww.gradio.app)：`pip install gradio`，然后即可直接运行 `python app.py`。\n\n\u003Cdiv style=\"display: flex; justify-content: center; flex-wrap: wrap;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FML-GSAI_LLaDA_readme_1578591293b3.gif\" style=\"width: 80%\" \u002F>\n\u003C\u002Fdiv>\n\n## 预训练与监督微调\n\n我们不会像大多数开源大模型那样公开训练框架和数据。\n\n然而，LLaDA 的预训练和监督微调过程非常简单。如果您已有用于训练自回归模型的代码库，只需添加几行代码，即可轻松适配 LLaDA。\n\n我们在 [GUIDELINES.md](GUIDELINES.md) 中提供了 LLaDA 预训练和 SFT 的指导说明。您也可以参考 [SMDM](https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FSMDM)，它采用了与 LLaDA 类似的训练流程，并已开源了训练框架。\n\n## 评估\n有关如何使用评估代码的说明，请参阅 [EVAL.md](EVAL.md)。\n\n## 常见问题解答\n在此，我们解答一些关于 LLaDA 的常见问题。\n\n### 0. 如何训练自己的 LLaDA？\n请参阅 [GUIDELINES.md](GUIDELINES.md) 获取相关指导。您也可以参考 [SMDM](https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FSMDM)，它遵循与 LLaDA 相同的训练流程，并已开源代码。\n\n\n### 1. LLaDA 与 BERT 有何不同？\n\n我们的初衷并非改进 BERT，也无意将类似 [MaskGIT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.04200) 的图像生成方法应用于文本。**我们的目标是探索一种理论上完备的语言建模方法——掩码扩散模型。** 在这一过程中，我们简化了方法，并发现掩码扩散模型的损失函数与 BERT 和 MaskGIT 的损失函数存在关联。您可以在第 7 问中找到我们的理论研究过程。\n\n具体而言，LLaDA 使用随机变化于 0 到 1 之间的掩码比例，而 BERT 则采用固定比例。这一细微差异具有重大意义。**LLaDA 的训练目标是模型分布负对数似然的上界，因此 LLaDA 是一种生成式模型。** 这使得 LLaDA 能够自然地进行上下文学习、指令跟随，并确保在大规模数据集和模型上的 Fisher 一致性，从而实现良好的可扩展性。您也可以在我们论文的 2.1 节中找到对此问题的直接解答。\n\n\n### 2. LLaDA 与 Transformer 有何关系？\n网络结构和概率建模是构成语言模型基础的两种不同方法。LLaDA 和 GPT 一样，都采用了 Transformer 架构。关键区别在于概率建模方式：GPT 使用自回归的下一个词预测方法，而 LLaDA 则采用扩散模型进行概率建模。\n\n### 3. LLaDA 的采样效率如何？\n目前，LLaDA 的采样速度比自回归基线模型慢，主要原因有以下三点：\n1. LLaDA 使用固定的上下文长度进行采样；\n2. LLaDA 尚无法利用 KV 缓存等技术；\n3. LLaDA 只有在采样步数等于生成响应长度时才能达到最佳性能。\n减少采样步数会导致性能下降，具体细节请参见我们论文的附录 B.4 和附录 B.6。\n\n在本工作中，我们旨在探索 LLaDA 能力的上限，**挑战“关键的 LLM 能力本质上与自回归模型相关”这一假设**。未来我们将继续优化其效率。我们认为这种研究方法是合理的，因为验证扩散语言模型能力的上限，将为我们提供更多的资源和充分的动力来提升效率。\n\n回顾图像扩散模型的发展历程，从 [DDPM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.11239) 到 [一致性模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.11081)，在短短四年内采样速度提升了近 1000 倍。**我们相信 LLaDA 的采样效率同样存在巨大的优化空间**。当前的一些解决方案，例如 [块扩散](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09573)，可以缓解固定上下文长度的问题；而 [一致性蒸馏](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.05415) 则能减少采样步数。此外，一些缓存方法（如 [Fast-dllm](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFast-dLLM)、[dllm-cache](https:\u002F\u002Fgithub.com\u002Fmaomaocun\u002FdLLM-cache)）也可以被 LLaDA 所借鉴。\n\n### 4. LLaDA 的训练稳定性如何？\n关于 LLaDA 预训练过程的详细信息，请参阅我们论文的第 2.2 节。在整个 2.3T 令牌的预训练过程中，我们仅在 1.2T 令牌处遇到一次训练崩溃（损失变为 NaN）。我们的解决方法是恢复检查点，并将学习率从 4e-4 降低至 1e-4。\n\n### 5. 为什么在 Tab4 中，最终答案“72”会比中间计算步骤（例如 12 × 4 = 48）更早生成呢？\n\n**掩码预测器已经成功预测了推理过程，但在重新掩码的过程中，推理步骤又被再次遮蔽了。** 如下图所示，非白色背景代表模型的生成过程，而白色背景的方框则表示掩码预测器在每一步做出的预测。我们采用了随机重新掩码的策略。\n\n\u003Cdiv style=\"display: flex; justify-content: center; flex-wrap: wrap;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FML-GSAI_LLaDA_readme_4bcea1fc370a.gif\" style=\"width: 80%\" \u002F>\n\u003C\u002Fdiv>\n\n### 6. 为什么当被问及“你是谁”时，LLaDA 会回答“Bailing”？\n这是因为我们的预训练数据和 SFT 数据原本是为自回归模型设计的，而 LLaDA 直接使用了包含身份标识的数据。\n\n### 7. 我们开发 LLaDA 的历程是怎样的？\nLLaDA 是基于我们先前的两项工作 [RADD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.03736) 和 [SMDM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18514) 构建的。\n\nRADD 证明了 **LLaDA 的训练目标是模型分布负对数似然的上界**，这一结论也得到了 [MD4](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04329) 和 [MDLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07524) 的支持。此外，RADD 还首次从理论上证明了 **掩码扩散模型无需将时间 t 作为 Transformer 的输入**。这一发现为 LLaDA 直接沿用原始 Transformer 架构提供了理论依据。最后，RADD 还表明 **掩码扩散模型的训练目标等价于任意阶自回归模型的训练目标**，这为掩码扩散模型克服反转诅咒提供了重要启示。\n\nSMDM 则首次提出了掩码扩散模型的 **缩放定律**，并证明在相同模型规模和训练数据条件下，掩码扩散模型能够达到与自回归模型相当的下游基准测试结果。此外，SMDM 还提出了一种简单且**无监督的无分类器指导**方法，显著提升了下游基准测试性能，该方法已被 LLaDA 采用。\n\n## 引用\n```bibtex\n@article{nie2025large,\n  title={Large Language Diffusion Models},\n  author={Nie, Shen and Zhu, Fengqi and You, Zebin and Zhang, Xiaolu and Ou, Jingyang and Hu, Jun and Zhou, Jun and Lin, Yankai and Wen, Ji-Rong and Li, Chongxuan},\n  journal={arXiv preprint arXiv:2502.09992},\n  year={2025}\n}\n```","# LLaDA 快速上手指南\n\nLLaDA (Large Language Diffusion with mAsk) 是一个基于扩散模型架构的大语言模型，规模达 8B，从头训练而成，性能可媲美 LLaMA3 8B。本指南将帮助您快速部署并运行 LLaDA。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux 或 macOS (Windows 需使用 WSL2)\n*   **Python**: 推荐 Python 3.9 或更高版本\n*   **GPU**: 推荐使用支持 CUDA 的 NVIDIA GPU (显存建议 16GB 以上以运行 8B 模型)\n*   **PyTorch**: 需安装与 CUDA 版本匹配的 PyTorch\n\n**前置依赖安装：**\n本项目严格依赖特定版本的 `transformers` 库。\n\n```bash\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\npip install transformers==4.38.2\n```\n\n> **提示**：国内用户可使用清华或阿里镜像源加速安装：\n> `pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118`\n> `pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple transformers==4.38.2`\n\n若需运行 Gradio 演示界面，还需安装：\n```bash\npip install gradio\n```\n\n## 安装步骤\n\nLLaDA 无需复杂的源码编译，主要通过 Hugging Face 直接加载模型。请确保网络能访问 Hugging Face，或配置好镜像代理。\n\n1.  **创建项目目录并进入**\n    ```bash\n    mkdir llada-demo && cd llada-demo\n    ```\n\n2.  **验证环境**\n    运行以下 Python 代码测试 `transformers` 版本及模型加载能力（首次运行会自动下载模型权重）：\n    ```python\n    from transformers import AutoModel, AutoTokenizer\n    import torch\n\n    # 加载分词器和模型 (以 Base 版本为例)\n    tokenizer = AutoTokenizer.from_pretrained('GSAI-ML\u002FLLaDA-8B-Base', trust_remote_code=True)\n    model = AutoModel.from_pretrained('GSAI-ML\u002FLLaDA-8B-Base', trust_remote_code=True, torch_dtype=torch.bfloat16)\n    \n    print(\"模型加载成功！\")\n    ```\n\n## 基本使用\n\n仓库提供了多种使用方式，包括脚本调用、对话交互和 Web 演示。\n\n### 1. 命令行多轮对话 (推荐)\n直接使用仓库提供的 `chat.py` 脚本与 **LLaDA-8B-Instruct** 进行交互式对话。\n\n```bash\npython chat.py\n```\n*运行后直接在终端输入问题即可，支持多轮上下文记忆。*\n\n### 2. 启动 Gradio Web 界面\n如果您更喜欢图形化界面，可以启动本地 Web Demo。\n\n```bash\npython app.py\n```\n*运行后终端会显示本地访问地址（通常为 `http:\u002F\u002F127.0.0.1:7860`），在浏览器打开即可体验。*\n\n### 3. 代码调用 (生成与评估)\n您可以在自己的 Python 脚本中调用核心功能进行条件生成或对数似然评估。\n\n```python\n# 示例：导入生成函数\n# 假设已按照上述“环境准备”加载了 model 和 tokenizer\nfrom generate import generate\n\n# 执行生成逻辑 (具体参数请参考 generate.py 内部实现)\n# response = generate(model, tokenizer, prompt=\"你的提示词\")\n```\n\n> **注意**：由于扩散模型的特性，LLaDA 的采样速度目前可能慢于自回归模型（如 LLaMA），且最佳性能通常需要采样步数等于响应长度。","某初创团队正在开发一款需要高并发、低延迟响应的智能客服系统，旨在处理海量用户咨询并生成自然流畅的回复。\n\n### 没有 LLaDA 时\n- **推理成本高企**：传统自回归模型在生成长文本时需逐词计算，导致显存占用大、推理速度慢，难以支撑高并发场景。\n- **长上下文控制弱**：面对复杂的多轮对话历史，模型容易丢失关键信息或产生逻辑断层，回复质量不稳定。\n- **架构升级困难**：若要提升性能通常需堆叠更大参数量的稠密模型，硬件投入成倍增加，且无法灵活调整激活参数量。\n- **生成多样性受限**：确定性采样策略使得回复模式固定，缺乏拟人化的变化，用户体验显得机械呆板。\n\n### 使用 LLaDA 后\n- **推理效率飞跃**：利用扩散机制并行生成特性，LLaDA 显著缩短了首字延迟和总生成时间，轻松应对流量高峰。\n- **全局一致性增强**：基于掩码的扩散过程让模型能同时关注全文上下文，确保长对话中的逻辑连贯与事实准确。\n- **稀疏架构降本**：引入 LLaDA-MoE 版本后，仅激活约 10 亿参数即可超越原有 8B 稠密模型表现，大幅降低部署成本。\n- **回复自然生动**：扩散模型固有的随机性赋予了回复更多样化的表达方式，使交互过程更接近真人沟通。\n\nLLaDA 通过扩散架构与稀疏化设计的结合，在保持顶尖性能的同时打破了传统语言模型的效率瓶颈，为大规模落地应用提供了全新范式。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FML-GSAI_LLaDA_15785912.gif","ML-GSAI","ML Group @ RUC","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FML-GSAI_0f126283.jpg","Chongxuan Li's research group @ Renmin University of China",null,"https:\u002F\u002Fgithub.com\u002FML-GSAI",[79,83,87,91,94,97],{"name":80,"color":81,"percentage":82},"Python","#3572A5",99.8,{"name":84,"color":85,"percentage":86},"Shell","#89e051",0.1,{"name":88,"color":89,"percentage":90},"CSS","#663399",0,{"name":92,"color":93,"percentage":90},"Makefile","#427819",{"name":95,"color":96,"percentage":90},"JavaScript","#f1e05a",{"name":98,"color":99,"percentage":90},"HTML","#e34c26",3729,258,"2026-04-17T18:52:25","MIT","未说明","需要支持 bfloat16 的 NVIDIA GPU（根据代码 torch_dtype=torch.bfloat16 推断），显存需求未明确说明（模型为 8B 参数，通常建议 24GB+ 以流畅运行），CUDA 版本未说明",{"notes":107,"python":104,"dependencies":108},"1. 必须安装特定版本的 transformers (4.38.2) 并使用 trust_remote_code=True 加载模型。\n2. 模型加载时需指定数据类型为 torch.bfloat16。\n3. 提供基础版 (Base) 和指令微调版 (Instruct) 两种 8B 模型，另有 MoE 架构版本可用。\n4. 推理速度相比自回归模型较慢，因为需要固定上下文长度且多步采样。\n5. 训练框架和数据集未开源，但提供了基于现有自回归代码修改的指导方针。",[109,110,111],"transformers==4.38.2","torch (需支持 bfloat16)","gradio (仅用于 Demo)",[14,113],"其他","2026-03-27T02:49:30.150509","2026-04-20T19:22:56.888248",[117,122,127,132,137,141],{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},45303,"如何在训练 LLaDA 模型时启用梯度检查点（Gradient Checkpointing）以节省显存？","可以通过向模型类注入自定义方法来实现。一种方法是使用 Hugging Face Trainer 兼容的方式，注册新的 API：\n\n```python\nimport importlib\ndef enable_llada_gradient_checkpointing(model):\n    cfg_modname = model.config.__class__.__module__\n    mdl_modname = model.__class__.__module__\n    cfg_mod = importlib.import_module(cfg_modname)\n    mdl_mod = importlib.import_module(mdl_modname)\n\n    ActivationCheckpointingStrategy = cfg_mod.ActivationCheckpointingStrategy\n\n    def gc_enable(self, gradient_checkpointing_kwargs=None):\n        self.config.use_cache = False\n        # 选择所需的 LLaDA 策略\n        self.model.set_activation_checkpointing(ActivationCheckpointingStrategy.whole_layer)\n\n    setattr(mdl_mod.LLaDAModelLM, \"supports_gradient_checkpointing\", True)\n    setattr(mdl_mod.LLaDAModelLM, \"gradient_checkpointing_enable\", gc_enable)\n```\n\n另一种通用方法是包装 Transformer 块：\n```python\nimport torch.utils.checkpoint as checkpoint\nclass CheckpointedWrapper(nn.Module):\n    def __init__(self, block):\n        super().__init__()\n        self.block = block\n    def forward(self, *args, **kwargs):\n        def custom_forward(*inputs):\n            return self.block(*inputs, **kwargs)\n        return checkpoint.checkpoint(custom_forward, *args)\n\n# 对每个 block 应用包装\nfor i, block in enumerate(model.model.transformer.blocks):\n    model.model.transformer.blocks[i] = CheckpointedWrapper(block)\n```","https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FLLaDA\u002Fissues\u002F73",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},45304,"LLaDA 与 BERT 等掩码语言模型（Masked Language Models）有什么本质区别？","LLaDA 基于掩码扩散模型（MDM）理论框架。虽然它与 BERT 类似（通过随机掩码部分 token 并预测它们进行训练），但关键区别在于：\n1. **生成能力**：BERT 主要用于语言理解，掩码比例通常不超过 15%；而 LLaDA 在训练时均匀选择 0% 到 100% 的掩码比例，使其具备生成能力。\n2. **训练目标**：MDM 的训练涉及连续时间变量，其损失函数是交叉熵项的时间相关积分，形成 ELBO（证据下界），支持最大似然估计。\n3. **采样过程**：MDM 通过时间离散化进行采样，而不是像传统自回归模型那样逐个 token 解码。\n不过，近期研究指出，如果均匀选择掩码数量并对所有掩码位置平均计算交叉熵损失，MDM 的 ELBO 可转化为普通掩码模型的 ELBO，且逐 token 解码可能更高效稳定。","https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FLLaDA\u002Fissues\u002F10",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},45305,"是否计划将 LLaDA 模型发布到 Hugging Face Hub 以便下载和使用？","是的，作者已响应社区请求，将 LLaDA-8B-Base 和 LLaDA-8B-Instruct 模型检查点发布到了 Hugging Face Hub。用户现在可以通过 `GSAI-ML\u002FLLaDA-8B-Base` 和 `GSAI-ML\u002FLLaDA-8B-Instruct` 仓库直接下载模型。此外，模型卡片（Model Cards）也已添加，方便用户了解模型详情并通过标签筛选发现模型。","https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FLLaDA\u002Fissues\u002F3",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},45306,"如何使用 lm-eval-harness 评估 LLaDA 的条件生成（conditional generation）能力？","目前官方代码库中尚未包含 `generate_until` 的实现，因此直接使用 lm-eval 评估条件生成能力受限。社区用户正在尝试复现结果并提交 PR。对于 LLaDA-Base，已有用户成功复现；对于 LLaDA-Instruct，难点在于如何在少样本测试中将聊天模板（chat template）正确添加到示例中。建议关注项目后续的 PR 更新，或参考论文附录中的设置自行实现评估逻辑。若成功复现，欢迎向作者提交代码或发送邮件交流。","https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FLLaDA\u002Fissues\u002F42",{"id":138,"question_zh":139,"answer_zh":140,"source_url":136},45307,"在使用“随机重掩码”（Randomly remasking）策略时，为什么会出现数值溢出或结果异常？","这是因为在计算重掩码置信度时，初始实现在 logits 上添加了 Gumbel 噪声。理论上，temperature=0 对应最低置信度重掩码，temperature=infinity 对应随机重掩码。但在测试随机重掩码时，若将 temperature 设置为极大的值，会导致计算出的置信度值发生数值溢出，最终产生的结果类似于从左到右的自回归生成，而非预期的随机掩码行为。修复方法是避免使用过大的 temperature 值，或调整噪声添加逻辑以确保数值稳定性。",{"id":142,"question_zh":143,"answer_zh":144,"source_url":126},45308,"Inception Labs 的 Mercury 模型为何能实现极高的推理速度，LLaDA 能否达到类似效果？","目前 Inception Labs 未在技术报告中披露 Mercury 模型的具体加速细节。考虑到双向注意力机制在 KV Cache 方面的局限性，Mercury 可能采用了与学术界常见的扩散语言模型（如 LLaDA）完全不同的技术手段。因此，暂时无法确定 LLaDA 是否能直接复用其加速方案，需要等待更多技术细节公开或社区进一步探索针对双向注意力的优化策略。",[]]