[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-yule-BUAA--MergeLM":3,"tool-yule-BUAA--MergeLM":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":83,"owner_website":84,"owner_url":85,"languages":86,"stars":91,"forks":92,"last_commit_at":93,"license":82,"difficulty_score":10,"env_os":94,"env_gpu":95,"env_ram":94,"env_deps":96,"category_tags":106,"github_topics":82,"view_count":23,"oss_zip_url":82,"oss_zip_packed_at":82,"status":16,"created_at":107,"updated_at":108,"faqs":109,"releases":149},3677,"yule-BUAA\u002FMergeLM","MergeLM","Codebase for Merging Language Models (ICML 2024)","MergeLM 是一个专注于大语言模型合并的开源项目，源自 ICML 2024 获奖论文《Language Models are Super Mario》。它核心解决了一个痛点：如何让多个经过不同任务微调的同源模型“取长补短”，融合成一个能力更全面的新模型，而无需昂贵的重新训练过程或大量 GPU 资源。\n\n该项目提出了一种名为 DARE（Drop And REscale）的创新技术。研究发现，模型微调后产生的参数变化中，绝大部分其实是冗余的。DARE 能够安全地将这些增量参数中的 90% 甚至 99% 直接置零而不影响模型原有能力。基于此，MergeLM 先将多个模型的参数进行稀疏化处理，再通过简单的参数平均即可实现高效合并。这就好比超级马里奥通过触碰道具直接获得新能力一样，让模型以“免费午餐”的方式吸收同源模型的特长。\n\nMergeLM 特别适合 AI 研究人员、大模型开发者以及希望优化模型性能的工程师使用。无论是想要整合不同领域知识的团队，还是希望在有限算力下探索模型潜力的高级用户，都能利用它轻松将多个 7B 或更大规模的模型合成为性能更强的单一模型，甚至曾助力合并模型在开源榜单上","MergeLM 是一个专注于大语言模型合并的开源项目，源自 ICML 2024 获奖论文《Language Models are Super Mario》。它核心解决了一个痛点：如何让多个经过不同任务微调的同源模型“取长补短”，融合成一个能力更全面的新模型，而无需昂贵的重新训练过程或大量 GPU 资源。\n\n该项目提出了一种名为 DARE（Drop And REscale）的创新技术。研究发现，模型微调后产生的参数变化中，绝大部分其实是冗余的。DARE 能够安全地将这些增量参数中的 90% 甚至 99% 直接置零而不影响模型原有能力。基于此，MergeLM 先将多个模型的参数进行稀疏化处理，再通过简单的参数平均即可实现高效合并。这就好比超级马里奥通过触碰道具直接获得新能力一样，让模型以“免费午餐”的方式吸收同源模型的特长。\n\nMergeLM 特别适合 AI 研究人员、大模型开发者以及希望优化模型性能的工程师使用。无论是想要整合不同领域知识的团队，还是希望在有限算力下探索模型潜力的高级用户，都能利用它轻松将多个 7B 或更大规模的模型合成为性能更强的单一模型，甚至曾助力合并模型在开源榜单上斩获榜首。","# Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch\n\n\u003Cdiv  align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyule-BUAA_MergeLM_readme_43d88db2bfad.jpeg\" width=\"25%\"> \n\u003C\u002Fdiv>\n\nThis repository is built for the paper [Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.03099).\n🔔 If you have any questions or suggestions, please feel free to let us know. \nYou can directly email [Le Yu](https:\u002F\u002Fyule-buaa.github.io\u002F) using the email address yule@buaa.edu.cn or post an issue on this repository.\n\n\n## 💥 News 💥\n\n- 🔥🔥🔥[**May 2, 2024**] Our paper is accepted at ICML 2024! The camera ready version is coming soon.\n- 🔥🔥🔥[**February 9, 2024**] Special thanks to [Sourab Mangrulkar](https:\u002F\u002Fgithub.com\u002Fpacman100) for integrating our work into the [huggingface\u002Fpeft Project](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft)!\n- 🔥🔥🔥[**January 28, 2024**] Our merged model [supermario_v2](https:\u002F\u002Fhuggingface.co\u002FvanillaOVO\u002Fsupermario_v2) ranks first among 7B models on the Open LLM Leaderboard!   \n- 🔥🔥🔥[**December 4, 2023**] We appreciate [Minhajul Hoque](https:\u002F\u002Fmedium.com\u002F@minh.hoque) for sharing our work on [Medium](https:\u002F\u002Fmedium.com\u002F@minh.hoque\u002Fpaper-explained-language-models-are-super-mario-2ebce6c2cf35)!\n- 🔥🔥🔥[**November 29, 2023**] Special thanks to [papersread.ai](https:\u002F\u002Fpapersread.ai\u002F) for sharing [our work](https:\u002F\u002Fpapersread.ai\u002Fe\u002Flanguage-models-are-super-mario-absorbing-abilities-from-homologous-models-as-a-free-lunch\u002F)!\n- 🔥🔥🔥[**November 29, 2023**] We appreciate [martyn](https:\u002F\u002Fgithub.com\u002Fmartyn) for extending our work to [Stable Diffusion models](https:\u002F\u002Fgithub.com\u002Fmartyn\u002Fsafetensors-merge-supermario)!\n- 🔥🔥🔥[**November 27, 2023**] Special thanks to [brucethemoose](https:\u002F\u002Fhuggingface.co\u002Fbrucethemoose) for applying our work on the [model](https:\u002F\u002Fhuggingface.co\u002Fbrucethemoose\u002FCapyTessBorosYi-34B-200K-DARE-Ties) on Hugging Face!\n- 🔥🔥🔥[**November 26, 2023**] We appreciate [cg123](https:\u002F\u002Fgithub.com\u002Fcg123) for integrating our work into the [mergekit Project](https:\u002F\u002Fgithub.com\u002Farcee-ai\u002Fmergekit)!\n- 🔥🔥🔥[**November 25, 2023**] Special thanks to [fly51fly](https:\u002F\u002Ftwitter.com\u002Ffly51fly) for sharing our work on [Twitter](https:\u002F\u002Ftwitter.com\u002Ffly51fly\u002Fstatus\u002F1728159826742755588)!\n- 🔥🔥🔥[**November 24, 2023**] We appreciate [uukuguy](https:\u002F\u002Fgithub.com\u002Fuukuguy) for integrating our work into the [Multi-LoRAs Project](https:\u002F\u002Fpypi.org\u002Fproject\u002Fmulti-loras\u002F0.2.0)!\n- 🔥🔥🔥[**November 23, 2023**] Special thanks to [WizardLM](https:\u002F\u002Ftwitter.com\u002FWizardLM_AI) for sharing our work on [Twitter](https:\u002F\u002Ftwitter.com\u002FWizardLM_AI\u002Fstatus\u002F1727672799391842468)!\n- 🔥🔥🔥[**November 21, 2023**] We appreciate [PaperWeekly](http:\u002F\u002Fwww.paperweekly.info) for sharing our work on [WeChat](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FYiqWovBUXIbzmUbL6uT-8g) and [Zhihu](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F668152236)! \n- 🔥🔥🔥[**November 11, 2023**] Special thanks to [夕小瑶](https:\u002F\u002Fxixiaoyao.github.io\u002Fabout\u002F) for sharing our work on [WeChat](https:\u002F\u002Fmp.weixin.qq.com\u002Fs?__biz=MzIwNzc2NTk0NQ%3D%3D&mid=2247565881&idx=2&sn=57985427fdb6751d617df801ca7fd810) and [Zhihu](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F666363702)!\n- 🔥🔥🔥[**November 6, 2023**] Our paper is available on [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.03099), [Papers With Code](https:\u002F\u002Fpaperswithcode.com\u002Fpaper\u002Flanguage-models-are-super-mario-absorbing), and [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2311.03099).\n\n\n## Overview\n\nIn this work, we uncover that Language Models (LMs), either encoder- or decoder-based, can **obtain new capabilities by assimilating the parameters of homologous models without the need for retraining or GPUs**. \n1. We introduce a novel operation called **DARE** to directly set most of (90% or even 99%) the delta parameters to zeros without affecting the capabilities of SFT LMs. \n2. We sparsify delta parameters of multiple SFT homologous models with DARE as a **general preprocessing technique** and subsequently merge them into a single model by parameter averaging.\n\nThe workflow is shown as follows,\n\u003Cdiv  align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyule-BUAA_MergeLM_readme_6348d58110b7.jpg\" width=\"80%\"> \n\u003C\u002Fdiv>\n\nBy conducting extensive experiments, we find that: \n1. DARE is effective for SFT models whose delta parameter value ranges are relatively small (e.g., within 0.005), being able to eliminate even 99\\% delta parameters. Larger models can tolerate a higher proportion of discarded parameters, indicating that SFT naturally learns an extremely sparse set of delta parameters, and nearly all abilities originate from the pre-trained LMs. See (a) in the figure below. \n2. DARE can merge multiple task-specific LMs into one LM with diverse abilities, which is able to possess the functionalities of all SFT models. For instance, the merger of WizardLM and WizardMath increases the GSM8K accuracy of WizardLM from 2.2 to 66.3, maintaining its instruction-following capabilities while surpassing WizardMath's original 64.2 performance. See (b) in the figure below.\n\u003Cdiv  align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyule-BUAA_MergeLM_readme_e5e3ee6187be.jpg\" width=\"80%\"> \n\u003C\u002Fdiv>\n\n\n## Language Models and Datasets \n\nWe conduct experiments on both encoder- and decoder-based LMs.\n* For encoder-based LMs, we choose bert-base-uncased and roberta-base as pre-trained backbones. Eight datasets from the GLUE benchmark are used, including CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, and RTE.\n* For decoder-based LMs, we choose LLaMA, Llama 2, and Code Llama as pre-trained backbones. WizardLM, WizardMath, WizardCoder-Python, and Code Alpaca are used as fine-tuned models. \nWe evaluate three tasks on five datasets: AlpacaEval (instruction-following), GSM8K and MATH (mathematical reasoning), and HumanEval and MBPP (code-generating).\n\nNote that we provide GSM8K, MATH, and MBPP datasets in ```math_code_data\u002F``` folder, which are obtained from [WizardLM repository](https:\u002F\u002Fgithub.com\u002Fnlpxucan\u002FWizardLM). \nOther datasets can be automatically downloaded by our codes. For language models, you can download them either manually or by our codes.   \n\nYou can also modify the ```cache_dir``` in the ```utils\u002Fload_config.py``` file to specify your own path to save datasets and models.\n\n\n## Model Merging Methods\n\nWe provide a well-coded implementation of five model merging methods in this repository, including \n[Average Merging](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.05482), \n[Task Arithmetic](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.04089), \n[Fisher Merging](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.09832), \n[RegMean](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09849), and \n[TIES-Merging](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01708). \nWe also combine the proposed [DARE](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.03099) with the above methods to facilitate the merging performance.\n\n\n## Environments\n\n[PyTorch 2.0.1](https:\u002F\u002Fpytorch.org\u002F),\n[transformers 4.33.1](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex),\n[datasets 2.13.1](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Findex),\n[vllm 0.1.4](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm),\n[human_eval](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fhuman-eval),\n[numpy](https:\u002F\u002Fgithub.com\u002Fnumpy\u002Fnumpy), and\n[tqdm](https:\u002F\u002Fgithub.com\u002Ftqdm\u002Ftqdm).\n\n\n## Executing Scripts for Encoder-based LMs\nFor encoder-based LMs, we first fine-tune them on the GLUE benchmark (support both single-task and multi-task settings), \nand then inference with them. We also provide scripts to merge encoder-based LMs with five model merging methods. \n\n### Scripts for Fine-Tuning on GLUE\n* Example of fine-tuning *roberta-base* on *CoLA* dataset under single-task setting:\n```{bash}\npython train_plms_glue.py --language_model_name roberta-base --dataset_name cola --learning_rate 1e-5 --num_runs 5\n```\n* Example of fine-tuning *roberta-base* on *CoLA* and *RTE* datasets under multi-task setting:\n```{bash}\npython train_plms_glue.py --language_model_name roberta-base --dataset_name cola --multitask_training --auxiliary_dataset_name rte --learning_rate 1e-5 --num_runs 5\n```\n\n### Scripts for Inference with DARE and Other Variants\n* Example of direct inference on *roberta-base* (drop rate 0.0):\n```{bash}\npython inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.0\n```\n* Example of inference on *roberta-base* with DARE (drop rate 0.9):\n```{bash}\npython inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9 --use_weight_rescale\n```\n* Example of inference on *roberta-base* with DropOnly (drop rate 0.9):\n```{bash}\npython inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9\n```\n* Example of inference on *roberta-base* with magnitude-based pruning (drop rate 0.9):\n```{bash}\npython inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9 --mask_strategy magnitude\n```\n* Example of inference on *roberta-base* with masking fine-tuned parameters (drop rate 0.9):\n```{bash}\npython inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9 --use_weight_rescale --weight_format finetuned_weight\n```\n\n### Scripts for Merging Models\n* Example of merging pairwise fine-tuned *roberta-base* with Average Merging:\n```{bash}\npython merge_plms_glue.py --merging_method_name average_merging --language_model_name roberta-base\n```\n* Example of merging pairwise fine-tuned *roberta-base* with Fisher Merging:\n```{bash}\npython merge_plms_glue.py --merging_method_name fisher_merging --normalize_fisher_weight --language_model_name roberta-base\n```\n* Example of merging pairwise fine-tuned *roberta-base* with Average Merging and DARE:\n```{bash}\npython merge_plms_glue.py --merging_method_name mask_merging --use_weight_rescale --language_model_name roberta-base --mask_apply_method average_merging\n```\n\n\n## Executing Scripts for Decoder-based LMs\nSince the decoder-based LMs we use have already been fine-tuned, they can be directly utilized for inference.\nWe also provide scripts to merge decoder-based LMs with two model merging methods (Average Merging and Task Arithmetic).\n\n### Scripts for Inference with DARE and Other Variants\n* Example of direct inference on *WizardMath-7B-V1.0* on *GSM8K* (drop rate 0.0):\n```{bash}\npython inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.0\n```\n* Example of inference on *WizardMath-7B-V1.0* on *GSM8K* with DARE (drop rate 0.9):\n```{bash}\npython inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9 --use_weight_rescale\n```\n* Example of inference on *WizardMath-7B-V1.0* on *GSM8K* with DropOnly (drop rate 0.9):\n```{bash}\npython inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9\n```\n* Example of inference on *WizardMath-7B-V1.0* on *GSM8K* with magnitude-based pruning (drop rate 0.9):\n```{bash}\npython inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9 --mask_strategy magnitude\n```\n* Example of inference on *WizardMath-7B-V1.0* on *GSM8K* with masking fine-tuned parameters (drop rate 0.9):\n```{bash}\npython inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9 --use_weight_rescale --weight_format finetuned_weight\n```\n\n### Scripts for Merging Models\n* Example of merging *WizardLM-13B-V1.2* and *WizardMath-13B-V1.0* with Average Merging:\n```{bash}\npython merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name average_merging --tensor_parallel_size 1\n```\n* Example of merging *WizardLM-13B-V1.2* and *WizardMath-13B-V1.0* with Task Arithmetic:\n```{bash}\npython merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name task_arithmetic --scaling_coefficient 1.0 --tensor_parallel_size 1\n```\n* Example of merging *WizardLM-13B-V1.2* and *WizardMath-13B-V1.0* with Average Merging and DARE (drop rate 0.2):\n```{bash}\npython merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name mask_merging --use_weight_rescale --weight_mask_rate 0.2 --mask_apply_method average_merging --tensor_parallel_size 1\n```\n\n❗**Note 1**: When merging decoder-based LMs, the number of GPUs we should allocate is equals to num_models_to_merge * tensor_parallel_size.\nFor example, if we want to merge *WizardLM-13B-V1.2* and *WizardMath-13B-V1.0* with tensor_parallel_size == 1, then we should allocate 2 * 1 = 2 GPUs.\n\n❗**Note 2**: If \"AssertionError: data parallel group is already initialized\" error is raised by vllm on your device, please try to run ```direct_inference_merged_llms_instruct_math_code.py``` with the corresponding setting.\nFor example, if this error occurs when merging *WizardLM-13B-V1.2* and *WizardMath-13B-V1.0* with Average Merging and DARE (drop rate 0.2), please run the following command to evaluate on instruct- or math-related task\n```{bash}\npython direct_inference_merged_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name mask_merging --use_weight_rescale --weight_mask_rate 0.2 --mask_apply_method average_merging --tensor_parallel_size 1 --evaluate_task instruct\npython direct_inference_merged_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name mask_merging --use_weight_rescale --weight_mask_rate 0.2 --mask_apply_method average_merging --tensor_parallel_size 1 --evaluate_task math\n```\n\n### Evaluation Process for AlpacaEval, HumanEval and MBPP\nFor AlpacaEval, HumanEval and MBPP, our codes will store the generated files and please additionally run the following evaluation commands to get the final metrics.\n\n* For AlpacaEval:\nWe use ```chatgpt_fn``` in [alpaca_eval repository](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval) to compute the win rate. Firstly, please see [alpaca_eval repository](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval) to install the environment.\nThen, if you want to evaluate the generated *WizardLM-13B-V1.2_inference_mask_0.2_rescale_True.json* file, please run\n```{bash}\nalpaca_eval --model_outputs .\u002Fsave_gen_instruct_responses_results\u002Falpaca_eval\u002FWizardLM-13B-V1.2_inference_mask_0.2_rescale_True.json --annotators_config chatgpt_fn --name WizardLM-13B-V1.2_inference_mask_0.2_rescale_True\n```\n\n* For HumanEval:\nFirstly, please see [human-eval repository](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fhuman-eval) to install the environment.\nThen, if you want to evaluate the generated *WizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl* file, please run\n```{bash}\nevaluate_functional_correctness .\u002Fsave_gen_codes_results\u002Fhuman_eval\u002FWizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl\n```\n\n* For MBPP:\nFirstly, please see [bigcode-evaluation-harness repository](https:\u002F\u002Fgithub.com\u002Fbigcode-project\u002Fbigcode-evaluation-harness) to install the environment.\nThen, if you want to evaluate the generated *WizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl* file, please run\n```{bash}\naccelerate launch .\u002Fbigcode-evaluation-harness\u002Fmain.py --tasks mbpp --allow_code_execution --load_generations_path .\u002Fsave_gen_codes_results\u002Fmbpp\u002FWizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl\n```\n\n\n## Acknowledgments\n\nWe are grateful to the authors of [WizardLM](https:\u002F\u002Fgithub.com\u002Fnlpxucan\u002FWizardLM) for making their project codes publicly available.\n\n\n## Citation\n\nPlease consider citing our paper when using this project.\n```{bibtex}\n@inproceedings{yu2024language,\n  title={Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch},\n  author={Yu, Le and Yu, Bowen and Yu, Haiyang and Huang, Fei and Li, Yongbin},\n  booktitle={International Conference on Machine Learning},\n  year={2024},\n  organization={PMLR}\n}\n```\n\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyule-BUAA_MergeLM_readme_e3743e44d079.png)](https:\u002F\u002Fstar-history.com\u002F#yule-BUAA\u002FMergeLM&Timeline)\n","# 语言模型就是超级马里奥：从同源模型中免费吸收能力\n\n\u003Cdiv  align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyule-BUAA_MergeLM_readme_43d88db2bfad.jpeg\" width=\"25%\"> \n\u003C\u002Fdiv>\n\n本仓库用于支持论文《语言模型就是超级马里奥：从同源模型中免费吸收能力》（[arXiv:2311.03099](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.03099)）。\n\n🔔 如果您有任何问题或建议，请随时告知我们。您可以直接通过邮箱 yule@buaa.edu.cn 联系 [Le Yu](https:\u002F\u002Fyule-buaa.github.io\u002F)，或者在本仓库中提交 issue。\n\n## 💥 最新消息 💥\n\n- 🔥🔥🔥[**2024年5月2日**] 我们的论文已被 ICML 2024 接受！最终定稿版本即将发布。\n- 🔥🔥🔥[**2024年2月9日**] 特别感谢 [Sourab Mangrulkar](https:\u002F\u002Fgithub.com\u002Fpacman100)，将我们的工作集成到 [huggingface\u002Fpeft 项目](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) 中！\n- 🔥🔥🔥[**2024年1月28日**] 我们的合并模型 [supermario_v2](https:\u002F\u002Fhuggingface.co\u002FvanillaOVO\u002Fsupermario_v2) 在 Open LLM Leaderboard 上的 7B 模型中排名第一！\n- 🔥🔥🔥[**2023年12月4日**] 感谢 [Minhajul Hoque](https:\u002F\u002Fmedium.com\u002F@minh.hoque) 在 [Medium](https:\u002F\u002Fmedium.com\u002F@minh.hoque\u002Fpaper-explained-language-models-are-super-mario-2ebce6c2cf35) 上分享我们的工作！\n- 🔥🔥🔥[**2023年11月29日**] 特别感谢 [papersread.ai](https:\u002F\u002Fpapersread.ai\u002F) 分享了我们的工作 [链接](https:\u002F\u002Fpapersread.ai\u002Fe\u002Flanguage-models-are-super-mario-absorbing-abilities-from-homologous-models-as-a-free-lunch\u002F)！\n- 🔥🔥🔥[**2023年11月27日**] 我们感谢 [martyn](https:\u002F\u002Fgithub.com\u002Fmartyn) 将我们的工作扩展到 [Stable Diffusion 模型](https:\u002F\u002Fgithub.com\u002Fmartyn\u002Fsafetensors-merge-supermario)！\n- 🔥🔥🔥[**2023年11月26日**] 特别感谢 [brucethemoose](https:\u002F\u002Fhuggingface.co\u002Fbrucethemoose) 将我们的工作应用到 Hugging Face 上的 [模型](https:\u002F\u002Fhuggingface.co\u002Fbrucethemoose\u002FCapyTessBorosYi-34B-200K-DARE-Ties)！\n- 🔥🔥🔥[**2023年11月25日**] 我们感谢 [cg123](https:\u002F\u002Fgithub.com\u002Fcg123) 将我们的工作集成到 [mergekit 项目](https:\u002F\u002Fgithub.com\u002Farcee-ai\u002Fmergekit) 中！\n- 🔥🔥🔥[**2023年11月24日**] 特别感谢 [fly51fly](https:\u002F\u002Ftwitter.com\u002Ffly51fly) 在 [Twitter](https:\u002F\u002Ftwitter.com\u002Ffly51fly\u002Fstatus\u002F1728159826742755588) 上分享我们的工作！\n- 🔥🔥🔥[**2023年11月23日**] 特别感谢 [WizardLM](https:\u002F\u002Ftwitter.com\u002FWizardLM_AI) 在 [Twitter](https:\u002F\u002Ftwitter.com\u002FWizardLM_AI\u002Fstatus\u002F1727672799391842468) 上分享我们的工作！\n- 🔥🔥🔥[**2023年11月21日**] 我们感谢 [PaperWeekly](http:\u002F\u002Fwww.paperweekly.info) 在 [微信](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FYiqWovBUXIbzmUbL6uT-8g) 和 [知乎](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F668152236) 上分享我们的工作！\n- 🔥🔥🔥[**2023年11月11日**] 特别感谢 [夕小瑶](https:\u002F\u002Fxixiaoyao.github.io\u002Fabout\u002F) 在 [微信](https:\u002F\u002Fmp.weixin.qq.com\u002Fs?__biz=MzIwNzc2NTk0NQ%3D%3D&mid=2247565881&idx=2&sn=57985427fdb6751d617df801ca7fd810) 和 [知乎](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F666363702) 上分享我们的工作！\n- 🔥🔥🔥[**2023年11月6日**] 我们的论文已在 [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.03099)、[Papers With Code](https:\u002F\u002Fpaperswithcode.com\u002Fpaper\u002Flanguage-models-are-super-mario-absorbing) 和 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2311.03099) 上公开。\n\n\n## 概述\n\n在本工作中，我们发现无论是编码器还是解码器架构的语言模型（LM），都可以在无需重新训练或使用 GPU 的情况下，通过吸收同源模型的参数来获得新的能力。具体来说：\n1. 我们提出了一种名为 **DARE** 的新操作，可以将大部分（90% 甚至 99%）的 delta 参数直接置为零，而不会影响 SFT 模型的能力。\n2. 我们将多个 SFT 同源模型的 delta 参数通过 DARE 进行稀疏化处理，作为一种通用的预处理技术，然后通过参数平均的方式将其合并为一个单一模型。\n\n工作流程如下所示：\n\u003Cdiv  align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyule-BUAA_MergeLM_readme_6348d58110b7.jpg\" width=\"80%\"> \n\u003C\u002Fdiv>\n\n通过大量实验，我们发现：\n1. DARE 对于 delta 参数值范围相对较小的 SFT 模型（例如在 0.005 以内）非常有效，甚至可以消除高达 99% 的 delta 参数。较大的模型能够容忍更高比例的被丢弃参数，这表明 SFT 天然地学习到了一组极其稀疏的 delta 参数，几乎所有能力都源自预训练模型本身。详见下图 (a)。\n2. DARE 可以将多个特定任务的 LM 合并为一个具备多样化能力的 LM，使其同时拥有所有 SFT 模型的功能。例如，将 WizardLM 和 WizardMath 合并后，WizardLM 的 GSM8K 准确率从 2.2% 提升至 66.3%，在保持指令遵循能力的同时，还超越了 WizardMath 原有的 64.2% 表现。详见下图 (b)。\n\u003Cdiv  align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyule-BUAA_MergeLM_readme_e5e3ee6187be.jpg\" width=\"80%\"> \n\u003C\u002Fdiv>\n\n\n## 语言模型与数据集\n\n我们在编码器和解码器两种架构的 LM 上进行了实验。\n* 对于编码器架构的 LM，我们选择了 bert-base-uncased 和 roberta-base 作为预训练骨干。使用了 GLUE 基准中的八项数据集，包括 CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI 和 RTE。\n* 对于解码器架构的 LM，我们选择了 LLaMA、Llama 2 和 Code Llama 作为预训练骨干。Fine-tuned 模型包括 WizardLM、WizardMath、WizardCoder-Python 和 Code Alpaca。\n我们评估了五种数据集上的三项任务：AlpacaEval（指令遵循）、GSM8K 和 MATH（数学推理），以及 HumanEval 和 MBPP（代码生成）。\n\n请注意，我们在 ```math_code_data\u002F``` 文件夹中提供了 GSM8K、MATH 和 MBPP 数据集，这些数据来自 [WizardLM 仓库](https:\u002F\u002Fgithub.com\u002Fnlpxucan\u002FWizardLM)。其他数据集可以通过我们的代码自动下载。对于语言模型，您可以手动下载，也可以使用我们的代码进行下载。\n您还可以修改 ```utils\u002Fload_config.py``` 文件中的 ```cache_dir``` 参数，指定您自己的路径来保存数据集和模型。\n\n\n## 模型合并方法\n\n本仓库提供了五种模型合并方法的完整实现，包括：\n[Average Merging](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.05482)、\n[Task Arithmetic](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.04089)、\n[Fisher Merging](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.09832)、\n[RegMean](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09849) 和 \n[TIES-Merging](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01708)。\n此外，我们还将提出的 [DARE](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.03099) 与上述方法结合使用，以进一步提升合并效果。\n\n## 环境\n\n[PyTorch 2.0.1](https:\u002F\u002Fpytorch.org\u002F)、\n[transformers 4.33.1](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex)、\n[datasets 2.13.1](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Findex)、\n[vllm 0.1.4](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)、\n[human_eval](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fhuman-eval)、\n[numpy](https:\u002F\u002Fgithub.com\u002Fnumpy\u002Fnumpy)，以及\n[tqdm](https:\u002F\u002Fgithub.com\u002Ftqdm\u002Ftqdm)。\n\n\n## 基于编码器的语言模型执行脚本\n对于基于编码器的语言模型，我们首先在 GLUE 基准上对其进行微调（支持单任务和多任务两种设置），然后进行推理。我们还提供了使用五种模型合并方法来合并基于编码器的语言模型的脚本。\n\n### 在 GLUE 上微调的脚本\n* 在单任务设置下对 *roberta-base* 模型在 *CoLA* 数据集上进行微调的示例：\n```{bash}\npython train_plms_glue.py --language_model_name roberta-base --dataset_name cola --learning_rate 1e-5 --num_runs 5\n```\n* 在多任务设置下对 *roberta-base* 模型在 *CoLA* 和 *RTE* 数据集上进行微调的示例：\n```{bash}\npython train_plms_glue.py --language_model_name roberta-base --dataset_name cola --multitask_training --auxiliary_dataset_name rte --learning_rate 1e-5 --num_runs 5\n```\n\n### 使用 DARE 及其他变体进行推理的脚本\n* 对 *roberta-base* 模型进行直接推理（丢弃率 0.0）的示例：\n```{bash}\npython inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.0\n```\n* 对 *roberta-base* 模型应用 DARE 进行推理（丢弃率 0.9）的示例：\n```{bash}\npython inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9 --use_weight_rescale\n```\n* 对 *roberta-base* 模型应用 DropOnly 进行推理（丢弃率 0.9）的示例：\n```{bash}\npython inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9\n```\n* 对 *roberta-base* 模型应用基于权重大小的剪枝进行推理（丢弃率 0.9）的示例：\n```{bash}\npython inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9 --mask_strategy magnitude\n```\n* 对 *roberta-base* 模型应用掩码细调参数进行推理（丢弃率 0.9）的示例：\n```{bash}\npython inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9 --use_weight_rescale --weight_format finetuned_weight\n```\n\n### 合并模型的脚本\n* 将成对微调过的 *roberta-base* 模型通过平均合并法进行合并的示例：\n```{bash}\npython merge_plms_glue.py --merging_method_name average_merging --language_model_name roberta-base\n```\n* 将成对微调过的 *roberta-base* 模型通过 Fisher 合并法进行合并的示例：\n```{bash}\npython merge_plms_glue.py --merging_method_name fisher_merging --normalize_fisher_weight --language_model_name roberta-base\n```\n* 将成对微调过的 *roberta-base* 模型通过平均合并法结合 DARE 进行合并的示例：\n```{bash}\npython merge_plms_glue.py --merging_method_name mask_merging --use_weight_rescale --language_model_name roberta-base --mask_apply_method average_merging\n```\n\n## 基于解码器的语言模型执行脚本\n由于我们使用的基于解码器的语言模型已经过微调，因此可以直接用于推理。我们还提供了使用两种模型合并方法（平均合并和任务算术）来合并基于解码器的语言模型的脚本。\n\n### 使用 DARE 及其他变体进行推理的脚本\n* 对 *WizardMath-7B-V1.0* 模型在 *GSM8K* 数据集上进行直接推理（丢弃率 0.0）的示例：\n```{bash}\npython inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.0\n```\n* 对 *WizardMath-7B-V1.0* 模型在 *GSM8K* 数据集上应用 DARE 进行推理（丢弃率 0.9）的示例：\n```{bash}\npython inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9 --use_weight_rescale\n```\n* 对 *WizardMath-7B-V1.0* 模型在 *GSM8K* 数据集上应用 DropOnly 进行推理（丢弃率 0.9）的示例：\n```{bash}\npython inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9\n```\n* 对 *WizardMath-7B-V1.0* 模型在 *GSM8K* 数据集上应用基于权重大小的剪枝进行推理（丢弃率 0.9）的示例：\n```{bash}\npython inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9 --mask_strategy magnitude\n```\n* 对 *WizardMath-7B-V1.0* 模型在 *GSM8K* 数据集上应用掩码细调参数进行推理（丢弃率 0.9）的示例：\n```{bash}\npython inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9 --use_weight_rescale --weight_format finetuned_weight\n```\n\n### 合并模型的脚本\n* 将 *WizardLM-13B-V1.2* 和 *WizardMath-13B-V1.0* 模型通过平均合并法进行合并的示例：\n```{bash}\npython merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name average_merging --tensor_parallel_size 1\n```\n* 将 *WizardLM-13B-V1.2* 和 *WizardMath-13B-V1.0* 模型通过任务算术法进行合并的示例：\n```{bash}\npython merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name task_arithmetic --scaling_coefficient 1.0 --tensor_parallel_size 1\n```\n* 将 *WizardLM-13B-V1.2* 和 *WizardMath-13B-V1.0* 模型通过平均合并法结合 DARE（丢弃率 0.2）进行合并的示例：\n```{bash}\npython merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name mask_merging --use_weight_rescale --weight_mask_rate 0.2 --mask_apply_method average_merging --tensor_parallel_size 1\n```\n\n❗**注 1**：在合并基于解码器的语言模型时，我们需要分配的 GPU 数量等于要合并的模型数量乘以张量并行度。例如，如果我们想将 *WizardLM-13B-V1.2* 和 *WizardMath-13B-V1.0* 模型以张量并行度为 1 的方式进行合并，则需要分配 2 × 1 = 2 个 GPU。\n\n❗**注 2**：如果您的设备上 vllm 抛出 “AssertionError: data parallel group is already initialized” 错误，请尝试使用相应的设置运行 ```direct_inference_merged_llms_instruct_math_code.py```。例如，如果在使用平均合并法结合 DARE（丢弃率 0.2）合并 *WizardLM-13B-V1.2* 和 *WizardMath-13B-V1.0* 时出现此错误，请运行以下命令来评估指令或数学相关任务：\n```{bash}\npython direct_inference_merged_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name mask_merging --use_weight_rescale --weight_mask_rate 0.2 --mask_apply_method average_merging --tensor_parallel_size 1 --evaluate_task instruct\npython direct_inference_merged_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name mask_merging --use_weight_rescale --weight_mask_rate 0.2 --mask_apply_method average_merging --tensor_parallel_size 1 --evaluate_task math\n```\n\n### AlpacaEval、HumanEval 和 MBPP 的评估流程\n对于 AlpacaEval、HumanEval 和 MBPP，我们的代码会保存生成的文件，请额外运行以下评估命令以获得最终的指标。\n\n* 对于 AlpacaEval：\n我们使用 [alpaca_eval 仓库](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval) 中的 ```chatgpt_fn``` 来计算胜率。首先，请参考 [alpaca_eval 仓库](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval) 安装环境。\n然后，如果您想评估生成的 *WizardLM-13B-V1.2_inference_mask_0.2_rescale_True.json* 文件，请运行：\n```{bash}\nalpaca_eval --model_outputs .\u002Fsave_gen_instruct_responses_results\u002Falpaca_eval\u002FWizardLM-13B-V1.2_inference_mask_0.2_rescale_True.json --annotators_config chatgpt_fn --name WizardLM-13B-V1.2_inference_mask_0.2_rescale_True\n```\n\n* 对于 HumanEval：\n首先，请参考 [human-eval 仓库](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fhuman-eval) 安装环境。\n然后，如果您想评估生成的 *WizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl* 文件，请运行：\n```{bash}\nevaluate_functional_correctness .\u002Fsave_gen_codes_results\u002Fhuman_eval\u002FWizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl\n```\n\n* 对于 MBPP：\n首先，请参考 [bigcode-evaluation-harness 仓库](https:\u002F\u002Fgithub.com\u002Fbigcode-project\u002Fbigcode-evaluation-harness) 安装环境。\n然后，如果您想评估生成的 *WizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl* 文件，请运行：\n```{bash}\naccelerate launch .\u002Fbigcode-evaluation-harness\u002Fmain.py --tasks mbpp --allow_code_execution --load_generations_path .\u002Fsave_gen_codes_results\u002Fmbpp\u002FWizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl\n```\n\n\n## 致谢\n\n我们感谢 [WizardLM](https:\u002F\u002Fgithub.com\u002Fnlpxucan\u002FWizardLM) 的作者们将其项目代码公开分享。\n\n\n## 引用\n在使用本项目时，请考虑引用我们的论文。\n```{bibtex}\n@inproceedings{yu2024language,\n  title={Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch},\n  author={Yu, Le and Yu, Bowen and Yu, Haiyang and Huang, Fei and Li, Yongbin},\n  booktitle={International Conference on Machine Learning},\n  year={2024},\n  organization={PMLR}\n}\n```\n\n\n## 星标历史\n\n[![星标历史图](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyule-BUAA_MergeLM_readme_e3743e44d079.png)](https:\u002F\u002Fstar-history.com\u002F#yule-BUAA\u002FMergeLM&Timeline)","# MergeLM 快速上手指南\n\nMergeLM 是一个基于论文《Language Models are Super Mario》的开源工具，旨在通过**DARE**（Drop And REscale）技术，无需重新训练或额外 GPU 资源，即可将多个同源大语言模型（LLM）的能力合并到一个模型中。它支持 Encoder-based（如 BERT）和 Decoder-based（如 LLaMA、WizardLM）模型。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐) 或 macOS\n- **Python**: 3.8+\n- **GPU**: 可选（推理和合并大模型时推荐），支持 CUDA\n\n### 前置依赖\n请确保已安装以下核心库（版本需严格匹配以避免兼容性问题）：\n- PyTorch >= 2.0.1\n- transformers == 4.33.1\n- datasets == 2.13.1\n- vllm == 0.1.4 (用于加速解码器模型推理)\n- human_eval, numpy, tqdm\n\n> **国内加速建议**：\n> 建议使用清华或阿里镜像源加速 Python 包下载：\n> `pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple \u003Cpackage_name>`\n\n## 安装步骤\n\n1. **克隆仓库**\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002Fyule-BUAA\u002FMergeLM.git\n   cd MergeLM\n   ```\n\n2. **安装依赖**\n   创建虚拟环境并安装所需包（推荐使用 pip）：\n   ```bash\n   pip install torch==2.0.1 transformers==4.33.1 datasets==2.13.1 vllm==0.1.4 human-eval numpy tqdm\n   ```\n   *注：若需完整复现所有功能，可参考项目根目录下的 `requirements.txt`（如有）。*\n\n3. **配置数据与模型缓存（可选）**\n   默认情况下，数据集和模型会下载到 Hugging Face 默认缓存目录。如需自定义路径，修改 `utils\u002Fload_config.py` 中的 `cache_dir`：\n   ```python\n   # utils\u002Fload_config.py\n   cache_dir = \"\u002Fyour\u002Fcustom\u002Fpath\u002Fto\u002Fcache\"\n   ```\n\n## 基本使用\n\nMergeLM 的核心流程分为两步：**稀疏化（DARE）** 和 **模型合并**。以下提供最常用的 Decoder-based 模型（如 WizardLM\u002FWizardMath）使用示例。\n\n### 1. 单模型推理与 DARE 测试\n在合并前，可先测试单个模型在使用 DARE 丢弃部分参数后的表现。以下示例对 `WizardMath-7B-V1.0` 在 `GSM8K` 数据集上进行推理，丢弃 90% 的增量参数并重新缩放。\n\n```bash\npython inference_llms_instruct_math_code.py \\\n    --dataset_name gsm8k \\\n    --finetuned_model_name WizardMath-7B-V1.0 \\\n    --tensor_parallel_size 1 \\\n    --weight_mask_rate 0.9 \\\n    --use_weight_rescale\n```\n*参数说明：*\n- `--weight_mask_rate 0.9`: 丢弃 90% 的 Delta 参数。\n- `--use_weight_rescale`: 启用 DARE 的重缩放机制，保持模型能力。\n\n### 2. 合并多个模型\n将具备不同能力的模型（如指令遵循能力的 `WizardLM` 和数学能力的 `WizardMath`）合并为一个全能模型。\n\n**示例：使用平均合并法 (Average Merging) + DARE**\n```bash\npython merge_llms_instruct_math_code.py \\\n    --merge_instruct \\\n    --merge_math \\\n    --merging_method_name mask_merging \\\n    --mask_apply_method average_merging \\\n    --use_weight_rescale \\\n    --tensor_parallel_size 1\n```\n*说明：*\n- `--merge_instruct` 和 `--merge_math`: 指定要合并的模型类型（需在代码或配置中对应具体模型路径）。\n- `mask_merging`: 表示结合 DARE 进行合并。\n- 合并后的模型权重通常会保存或在内存中直接用于后续推理。\n\n### 3. Encoder-based 模型示例 (BERT\u002FRoBERTa)\n如果您使用的是 BERT 等编码器模型，可先微调再合并。以下为合并两个微调过的 `roberta-base` 模型的命令：\n\n```bash\npython merge_plms_glue.py \\\n    --merging_method_name mask_merging \\\n    --use_weight_rescale \\\n    --language_model_name roberta-base \\\n    --mask_apply_method average_merging\n```\n\n---\n**提示**：更多高级用法（如 Task Arithmetic、Fisher Merging 等）及详细参数请参考仓库中的 `README.md` 或脚本帮助信息 (`python script_name.py --help`)。","某初创团队希望快速构建一个兼具医疗问答、法律条文解读和代码生成能力的多功能助手，但受限于算力预算无法对大模型进行全量微调。\n\n### 没有 MergeLM 时\n- **训练成本高昂**：为了让模型掌握三个领域的知识，团队需分别收集数据并进行三次独立的全量微调或 LoRA 训练，消耗大量 GPU 机时。\n- **显存资源瓶颈**：同时加载多个领域模型或进行大规模参数合并时，单卡显存往往不足，迫使团队升级昂贵的硬件设施。\n- **部署维护复杂**：最终需要维护三个独立的模型文件，用户在不同任务间切换时需频繁重载模型，导致服务延迟高且架构臃肿。\n- **能力相互干扰**：若尝试将不同领域的数据混合在一起重新训练，容易出现“灾难性遗忘”，导致模型在某一领域表现下降。\n\n### 使用 MergeLM 后\n- **零成本融合能力**：利用 DARE 技术稀疏化各领域微调模型的增量参数，无需任何额外训练或 GPU 资源，即可直接将医疗、法律和代码模型的能力“吸收”到一个基座模型中。\n- **单模型高效部署**：成功将三个专用模型合并为单一的 `supermario` 版本模型，显著降低显存占用，实现单卡部署多能助手。\n- **保留专项特长**：通过参数平均与稀疏化处理，合并后的模型在各垂直领域的表现几乎无损，避免了传统混合训练带来的性能衰退。\n- **迭代速度飞跃**：新增领域能力时，只需微调一个新模型并再次合并，无需从头开始训练，极大缩短了产品迭代周期。\n\nMergeLM 让开发者像“超级马里奥”吃道具一样，以零训练成本免费获取同源模型的专长，彻底打破了多能力大模型落地的算力壁垒。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyule-BUAA_MergeLM_abc33f51.png","yule-BUAA","Yu Le","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fyule-BUAA_1cfdc66e.png","Qwen Team, Alibaba Group,\r\nPh.D. at Beihang University","Beihang University","Beijing",null,"YuLe57423534941","https:\u002F\u002Fyule-buaa.github.io\u002F","https:\u002F\u002Fgithub.com\u002Fyule-BUAA",[87],{"name":88,"color":89,"percentage":90},"Python","#3572A5",100,863,52,"2026-03-31T19:18:49","未说明","推理脚本支持 tensor_parallel_size 参数，暗示需要 GPU；具体型号和显存大小取决于所选模型（如 7B\u002F13B LLaMA），README 未明确指定最低要求。",{"notes":97,"python":94,"dependencies":98},"该工具核心亮点是合并同源模型无需重新训练或特定 GPU 即可吸收新能力（但在运行提供的推理\u002F合并脚本时通常仍需 GPU 加速）。代码支持 Encoder-based (BERT\u002FRoBERTa) 和 Decoder-based (LLaMA\u002FWizardLM) 模型。用户需自行下载数据集和模型，或通过修改 utils\u002Fload_config.py 中的 cache_dir 指定存储路径。部分功能依赖 vllm 进行高效推理。",[99,100,101,102,103,104,105],"torch==2.0.1","transformers==4.33.1","datasets==2.13.1","vllm==0.1.4","human_eval","numpy","tqdm",[26,13],"2026-03-27T02:49:30.150509","2026-04-06T11:31:10.462009",[110,115,120,125,130,135,140,144],{"id":111,"question_zh":112,"answer_zh":113,"source_url":114},16842,"如何合并多个大语言模型（如 Llama-7B\u002FVicuna）？支持哪些工具？","对于基于 Decoder 的大语言模型（LLM），当前仓库的实现可能需要较大的内存。推荐使用 [mergekit](https:\u002F\u002Fgithub.com\u002Farcee-ai\u002Fmergekit) 工具，它集成了 DARE 方法且内存效率更高。您可以先使用 mergekit 合并模型，然后使用本项目的评估代码进行测试。\n此外，作者已开源了优化后的新代码库 [MergeLLM](https:\u002F\u002Fgithub.com\u002Fyule-BUAA\u002FMergeLLM)，更方便使用并支持此类需求。\n注意：mergekit 在实现 DARE 时会自动使用 density（即 1 - drop rate）进行缩放，无需人为指定缩放比例；weight 参数对应 lambda，用于控制参数在合并中的重要性。","https:\u002F\u002Fgithub.com\u002Fyule-BUAA\u002FMergeLM\u002Fissues\u002F29",{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},16843,"运行脚本时遇到 \"AssertionError: cannot find file trainer_state.json!\" 错误怎么办？","该错误通常发生在推理或合并步骤中，因为脚本试图读取训练状态文件但未找到。这可能是因为您直接使用了预训练模型而未进行微调，或者训练输出目录配置不正确。\n如果是针对 LLM 的实验，请确认您的硬件环境（特别是 RAM）。当前实现的 LLM 模型合并需要大量内存，如果设备内存不足可能导致流程中断。建议检查训练日志确认是否成功保存了 checkpoint，或者考虑使用内存更高效的 mergekit 工具进行合并操作。","https:\u002F\u002Fgithub.com\u002Fyule-BUAA\u002FMergeLM\u002Fissues\u002F25",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},16844,"项目是否提供已经合并好的模型权重下载？","是的，作者已上传合并后的 checkpoint 到百度网盘。由于指令遵循（instruction-following）和代码生成（code-generating）模型的 tokenizer 配置不同，分别存储了两个版本，但它们的模型参数是完全一致的。\n1. 指令遵循任务合并模型：\n链接：https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1thtOAGeHlCOZSFvcXgl6hQ\n提取码：zykq\n2. 代码生成任务合并模型：\n链接：https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1mkC3GobfqUbKXqTvY1QCzw\n提取码：ccu0","https:\u002F\u002Fgithub.com\u002Fyule-BUAA\u002FMergeLM\u002Fissues\u002F6",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},16845,"合并多个任务模型（如 instruct, math, code）时，为什么要保存三份模型？权重一样吗？","代码中针对三个任务保存三个模型主要是为了解决不同基座模型 tokenizer 不一致的问题。实际上，这三个模型的权重是完全一致的（仅经过一次合并操作得到），区别仅在于分别保存了 instruct、math 和 code 对应的 tokenizer。\n更优化的实现方式是仅保存一份模型权重，额外保存三个 tokenizer。作者最新的 [MergeLLM](https:\u002F\u002Fgithub.com\u002Fyule-BUAA\u002FMergeLLM) 项目已经采用了这种实现方式。","https:\u002F\u002Fgithub.com\u002Fyule-BUAA\u002FMergeLM\u002Fissues\u002F34",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},16846,"DARE 方法中的 Drop Rate（丢弃率）最大能设为多少？Rescale 的作用是什么？","Drop 的比例与 Delta 参数变化范围小于 0.002 的占比没有明确的固定关系。如果占比小于 70%，可以尝试依次递增 drop rate 来测试模型性能保持稳定的最大值。\nRescale（重缩放）对于 DARE 方法至关重要。如果不使用 rescale，大部分情况下当 drop rate 达到 0.3 或 0.4 时，模型性能就会显著下降（详见论文 4.4 节）。因此在使用高丢弃率时务必开启 `--use_weight_rescale` 参数。","https:\u002F\u002Fgithub.com\u002Fyule-BUAA\u002FMergeLM\u002Fissues\u002F30",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},16847,"合并命令中 scaling_coefficient、weight_mask_rate 等参数具体对应什么含义？","参数对应关系如下：\n1. `scaling_coefficient` 对应论文中的 scaling term（缩放系数）。\n2. `weight_mask_rate` 对应 DARE 方法中的 drop rate（丢弃率）。\n3. 如果是 TIES-Merging 方法，retain ratio（保留率）对应 `1 - param_value_mask_rate`。\n\n示例命令（Task Arithmetic + DARE，drop rate 0.1, scaling 1.0）：\npython merge_llms_instruct_math_code.py --merge_instruct --merge_math --merge_code --merging_method_name mask_merging --mask_apply_method task_arithmetic --weight_mask_rate 0.1 --use_weight_rescale --mask_strategy random --scaling_coefficient 1.0 --tensor_parallel_size 4","https:\u002F\u002Fgithub.com\u002Fyule-BUAA\u002FMergeLM\u002Fissues\u002F39",{"id":141,"question_zh":142,"answer_zh":143,"source_url":129},16848,"为什么论文中 WizardLM-13B 在代码任务上的效果比 llama-2-13b-codealpaca 好？","可能的原因是 llama-2-13b-codealpaca 没有在代码任务上进行充分的微调，导致其在代码任务上的表现不佳（论文第 7 页有相关分析）。\n实验中选择 llama-2-13b-codealpaca 是因为它是少数基于 Llama-2-13B 进行 SFT 的开源代码模型，且与 WizardLM-13B\u002FWizardMath-13B 同源。虽然 WizardCoder-Python-13B 在代码上效果更好，但因基座模型不同源，未被选入此次模型合并实验。",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},16849,"合并两个模型时，是否需要预训练模型（PT Model）？Task Arithmetic 等方法是如何工作的？","这取决于使用的合并方法：\n1. Task Arithmetic 和 TIES-Merging 等方法：需要预训练模型（PT Model）。它们通过计算微调模型（SFT）与预训练模型（PT）之间的差值来定义“任务向量”（Task Vector），然后对这些任务向量进行操作。\n2. Average Merging（平均合并）等方法：不需要预训练模型，直接对模型权重进行平均。\n如果您想合并两个 Llama 模型，需根据所选算法决定是否提供原始的预训练底座模型。","https:\u002F\u002Fgithub.com\u002Fyule-BUAA\u002FMergeLM\u002Fissues\u002F5",[]]