[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-princeton-nlp--LLM-Shearing":3,"tool-princeton-nlp--LLM-Shearing":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",151314,2,"2026-04-11T23:32:58",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":96,"env_os":97,"env_gpu":98,"env_ram":99,"env_deps":100,"category_tags":109,"github_topics":110,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":118,"updated_at":119,"faqs":120,"releases":146},6716,"princeton-nlp\u002FLLM-Shearing","LLM-Shearing","[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning","LLM-Shearing 是一项旨在加速大语言模型预训练的创新技术，其核心成果\"Sheared LLaMA\"曾入选 ICLR 2024。它主要解决了从头训练高性能小型语言模型成本高昂、耗时漫长的痛点。通过结构化剪枝（Structured Pruning）算法，LLM-Shearing 能够直接从成熟的超大模型（如 Llama-2-7B）中“剪切”出更小规模的模型，并辅以持续的预训练微调。\n\n研究表明，这种方法能以极低的计算成本获得卓越性能。例如，基于 Llama-2-7B 剪枝得到的模型，其能力可媲美从零开始训练的同类模型，但预训练成本仅为后者的 3%。该项目不仅开源了完整的剪枝与再训练代码库（基于 MosaicML Composer 构建），还直接提供了多个不同规模的高质量开源模型权重，包括基础版和指令微调版。\n\nLLM-Shearing 非常适合 AI 研究人员、大模型开发者以及希望高效部署轻量化模型的技术团队使用。对于资源有限但需要定制化小模型的机构而言，它提供了一条极具性价比的技术路径，让获取强大且紧凑的语言模型不再受限于昂贵的算力投入。","# 🦙 Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning\n\n🌟 [ArXiv Preprint](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06694) | [Blog Post](https:\u002F\u002Fxiamengzhou.github.io\u002Fsheared-llama\u002F)  \n\nBase models: [Sheared-LLaMA-1.3B](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-1.3B) | [Sheared-LLaMA-2.7B](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-2.7B) | [Sheared-Pythia-160m](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-Pythia-160m\u002Ftree\u002Fmain)  \nPruned Models without Continued Pre-training: [Sheared-LLaMA-1.3B-Pruned](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-1.3B-Pruned), [Sheared-LLaMA-2.7B-Pruned](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-2.7B-Pruned)  \nInstruction-tuned models: [Sheared-LLaMA-1.3B-ShareGPT](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-1.3B-ShareGPT) | [Sheared-LLaMA-2.7B-ShareGPT](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-2.7B-ShareGPT)\n\nThank you for your interest in our work! This is a joint work by [Mengzhou Xia](https:\u002F\u002Fxiamengzhou.github.io\u002F), [Tianyu Gao](https:\u002F\u002Fgaotianyu.xyz\u002Fabout\u002F), [Zhiyuan Zeng](https:\u002F\u002Fzhiyuan-zeng.github.io\u002F), and [Danqi Chen](https:\u002F\u002Fwww.cs.princeton.edu\u002F~danqic\u002F). Here, we provide our codebase for Sheared-LLaMA's pruning and continued pre-training algorithms :) We find that pruning strong base models is an extremely cost-effective way to get strong small-scale language models compared to pre-training them from scratch. The following graph shows that given the existence of Llama-2-7B model (pre-trained with 2T tokens), pruning it produces a model as strong as an OpenLLaMA model with 3% of its pre-training cost. \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fprinceton-nlp_LLM-Shearing_readme_22ed03b2ca0e.jpg\" alt=\"teaser\" width=\"400\" \u002F>\n\n**Update**\n- [12\u002F19\u002F2023] Updated the [evaluation scripts](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Ficl_eval) and pruning logs in the repo. \n- [11\u002F22\u002F2023] We released the instruction-tuned models [Sheared-LLaMA-1.3B-ShareGPT](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-1.3B-ShareGPT) and [Sheared-LLaMA-2.7B-ShareGPT](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-2.7B-ShareGPT).\n- [11\u002F19\u002F2023] We released the [Sheared-Pythia-160m](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-Pythia-160m) model developed at early stages. It was produced using the same shearing recipe and the Pile dataset. \n- [11\u002F05\u002F2023] We released the code on LLM-Shearing - excited to see it being applied to more models of different scales.\n- [10\u002F10\u002F2023] We released the Sheared-LLaMA paper, two Sheared LLaMA models and [tweeted about it](https:\u002F\u002Ftwitter.com\u002Fxiamengzhou\u002Fstatus\u002F1712102912439226510) 🚀!\n\n## 🔗 Quick Links\n- [🦙 Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning](#-sheared-llama-accelerating-language-model-pre-training-via-structured-pruning)\n  - [🔗 Quick Links](#-quick-links)\n  - [Brief Introduction](#brief-introduction)\n  - [Install Requirements](#install-requirements)\n  - [Data Preparation](#data-preparation)\n  - [Model Preparation](#model-preparation)\n  - [Sample Scripts for Pruning and Continued Pre-training](#sample-scripts-for-pruning-and-continued-pre-training)\n  - [Convert Pruned Model](#convert-pruned-model)\n  - [Convert Composer Model to Huggingface Model](#convert-composer-model-to-huggingface-model)\n  - [Training Configurations](#training-configurations)\n    - [Data configurations](#data-configurations)\n    - [Basic training configurations](#basic-training-configurations)\n    - [Pruning configurations](#pruning-configurations)\n    - [Dynamic batch loading configurations](#dynamic-batch-loading-configurations)\n  - [Throughput](#throughput)\n  - [Future Work](#future-work)\n  - [Bugs or Questions?](#bugs-or-questions)\n  - [Citation](#citation)\n\n\n##  Brief Introduction \nThis codebase is built based on MosaicML's amazing [Composer package](https:\u002F\u002Fgithub.com\u002Fmosaicml), which is specially designed and optimized for large language model pre-training. The entire implementation, including the `pruning` logic and the `dynamic batch loading` logic, are implemented as callback functions without touching the vanilla Composer trainer. Here's a concise overview of each folder within the codebase:\n- `shearing.data`: Contains sample data and scripts for data processing. \n- `shearing.datasets`: Implements customized datasets to enable dynamic data loading.\n- `shearing.callbacks`: Implements dynamic loading callbacks and pruning callbacks.\n- `shearing.models`: Implements the model files.\n- `shearing.scripts`: Contains scripts for running the code.\n- `shearing.utils`: Includes all utility functions, such as model conversion and pruning tests.\n- `train.py`: main entry of running the code\n\n\n\n\n## Install Requirements\n**Step 1**: To get started with this repository, you'll need to follow these installation steps. Before proceeding, make sure you have [Pytorch](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Fprevious-versions\u002F) and [Flash Attention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention)  installed. You can do this via pip using the following commands:\n```\npip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\npip install flash-attn==1.0.3.post\n```\nPlease note that Flash Attention version 2 is not currently supported and may require manual modifications to the model file. \n\n**Step 2**: Then install the rest of the required packages:\n```\ncd llmshearing\npip install -r requirement.txt\n```\n\n**Step 3**: Finally, install the `llmshearing` package in editable mode to make it accessible for your development environment:\n```\npip install -e .\n```\n\n\n## Data Preparation\nPlease refer to [llmshearing\u002Fdata](llmshearing\u002Fdata) for details on how to prepare data with Mosaicml's [Streaming](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming) package. \n\n## Model Preparation\nTo utilize Hugging Face transformer models with Composer, you'll need to convert the model weights to the key format expected by Composer. Here's an example of how to convert the weights from the Hugging Face model 'llama2' into a compatible format for Composer:\n```\n# Define the Hugging Face model name and the output path\nHF_MODEL_NAME=meta-llama\u002FLlama-2-7b-hf\nOUTPUT_PATH=models\u002FLlama-2-7b-composer\u002Fstate_dict.pt\n\n# Create the necessary directory if it doesn't exist\nmkdir -p $(dirname $OUTPUT_PATH)\n\n# Convert the Hugging Face model to Composer key format\npython3 -m llmshearing.utils.composer_to_hf save_hf_to_composer $HF_MODEL_NAME $OUTPUT_PATH\n```\n\nAdditionally, you can use the following utility function to test the equivalence between the Hugging Face model and the converted Composer model:\n```\nMODEL_SIZE=7B\npython3 -m llmshearing.utils.test_composer_hf_eq $HF_MODEL_NAME $OUTPUT_PATH $MODEL_SIZE\n```\n\nThese functions exclusively work for LLaMA\u002FLLaMA2 models. However, it should be straightforward to adapt them for use with other models such as Mistral-7B.\n\n\n## Sample Scripts for Pruning and Continued Pre-training\nFor pruning, you can reference an example script located in [`llmshearing\u002Fscripts\u002Fpruning.sh`](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Fllmshearing\u002Fscripts\u002Fpruning.sh).  In this script, you will need to make adjustments to incorporate [data configurations](), [basic training configurations](), [pruning configurations]() and [dynamic batch loading configurations](). \n\nDue to the relatively higher computational cost of pruning compared to continued pre-training, we halt training with the pruning objective after a specific number of steps (typically 3200 steps in all our experiments). Subsequently, we proceed with further pre-training of the pruned model. To ensure compatibility, it is necessary to convert the state dictionary keys of the model to align with a standard target model structure. Detailed instructions for this conversion can be found at [Convert Pruned Model](#convert-pruned-model).\n\nAfter completing the model conversion, you can continue with the pre-training of the pruned model. The process is similar to pre-train a standard model. To do this, you can refer to an example script located at [`llmshearing\u002Fscripts\u002Fcontinue_pretraining.sh`]((https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Fllmshearing\u002Fscripts\u002Fcontinue_pretraining.sh)). In this script, the pruning configurations are eliminated.  \n\nAfter training the model, you can use the conversion script to convert the composer model into a transformers model. Please refer to  [Section Convert Composer Model to Huggingface Model](#convert-composer-model-to-huggingface-model) for more details.\n\n## Convert Pruned Model\nFollowing the completion of training using [`llmshearing\u002Fscripts\u002Fpruning.sh`](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Fllmshearing\u002Fscripts\u002Fpruning.sh), the saved models consist of the entire parameters of the source model, accompanied by a set of masks. We then act upon the masking variables by 1) removing the substructures where the masking variables are near $0$, 2) subsuming the masking variables into the model parameters by matrix-vector multiplcaition, and it result in a more compact model. Simultaneously, it becomes necessary to rename the weight keys so that they can be seamlessly loaded into a target model architecture, ensuring that the layer names are all consecutive. \n\n```\nMODEL_PATH=$MODEL_DIR\u002Flatest-rank0.pt\npython3 -m llmshearing.utils.post_pruning_processing prune_and_save_model $MODEL_PATH\n```\n\nThe pruned model will be saved in `$(dirname $MODEL_PATH)\u002Fpruned-latest-rank0.pt`. \n\n## Convert Composer Model to Huggingface Model\nAfter training, if you'd like to use use huggingface for inference or fine-tuning, you may opt to transform your composer model into a Hugging Face model using the [`llmshearing\u002Fscripts\u002Fcomposer_to_hf.py`](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Fllmshearing\u002Futils\u002Fcomposer_to_hf.py) script. Here's an example of how to use the script:\n\n```\nMODEL_PATH=$MODEL_DIR\u002Flatest-rank0.pt\nOUTPUT_PATH=$MODEL_DIR\u002Fhf-latest_rank0\nMODEL_CLASS=LlamaForCausalLM\nHIDDEN_SIZE=2048\nNUM_ATTENTION_HEADS=16\nNUM_HIDDEN_LAYERS=24\nINTERMEDIATE_SIZE=5504\nMODEL_NAME=Sheared-Llama-1.3B\n\npython3 -m llmshearing.utils.composer_to_hf save_composer_to_hf $MODEL_PATH $OUTPUT_PATH \\\n        model_class=${MODEL_CLASS} \\\n        hidden_size=${HIDDEN_SIZE} \\\n        num_attention_heads=${NUM_ATTENTION_HEADS} \\\n        num_hidden_layers=${NUM_HIDDEN_LAYERS} \\\n        intermediate_size=${INTERMEDIATE_SIZE} \\\n        num_key_value_heads=${NUM_ATTENTION_HEADS} \\\n        _name_or_path=${MODEL_NAME}\n\n```\nPlease be aware that the parameter names mentioned here are tailored to Llama2's Hugging Face configurations and may differ when dealing with other model types.\n\n## Training Configurations\nIn this section, we provide an in-depth guide on configuring parameters within YAML configuration files for training. These configurations encompass several key aspects, including data setup, fundamental training settings, pruning settings, and dynamic data loading configurations.\n\n### Data configurations\n- `data_local`: The local directory containing the data. \n- `eval_loader.dataset.split`: For evaluation, provide the name of a combined split that includes data from all domains.\n- `train_loader.dataset.split`: When `dynamic=True` (please refer to the [dynamic loading section](#dynamic)) in the dynamic loading configuration, there's no need to set this value. However, if `dynamic=False`, you must specify a training split. \n\n### Basic training configurations\nThe basic training configurations largely follow the original `Composer` package. For comprehensive details on these configurations, please refer to  [Composer's official documentation](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fcomposer\u002Fen\u002Fstable\u002F). Here are some key training parameters to take note of:\n\n- `max_duration`: This parameter defines the maximum training duration and can be specified in either the number of steps (e.g., `3200ba`) or epochs (e.g., `1ep`). In our experiments, the pruning duration was set to `3200ba`, and the continued pre-training duration was set to `48000ba`.\n- `save_interval`: This parameter determines how frequently the model state is saved. We set it to `3200ba` for both the pruning and continued pre-training stages..\n- `t_warmup`: This parameter specifies the duration of the learning rate warm-up for the learning rate scheduler. In the case of pruning, it is set to `320ba` ($10%$ of training), while for continued pre-training, it is set to 1440ba ($3%$ of training).\n- `optimizer.lr`: This parameter defines the learning rate for the primary model parameters, with the default value being `1e-4`.\n- `max_seq_len`: Following the Llama 2 training methodology, we accommodate a maximum sequence length of 4096.\n- `device_train_microbatch_size`: This parameter determines the batch size per device during training. For the pruning stage, we configure it to `4`, whereas for continued pre-training, it is set to `16`.\n- `global_train_batch_size`: This parameter specifies the global batch size across all GPUs during training. During the pruning stage, it is configured as `32`, while for continued pre-training, it is increased to `256`.\n- `autoresume`: This parameter can be enabled by setting it to `true` when resuming a run. However, it's important to note that while we have used it successfully during the continued pretraining stage, there is no guarantee of its compatibility with the pruning stage.\n\nDue to computational constraints, an exhaustive hyperparameter search was not conducted, and there may exist better hyper-parameters for improved performance.\n\n\n### Pruning configurations\nThe pruning process allows pruning a source model to a specific target shape, and the script includes essential parameters such as:\n\n- `from_model`: This parameter specifies the source model size and corresponds to a config_file.  \n- `to_model`: This parameter defines the target model size, and the source model will be pruned to match the target configuration.  \n- `optimizer.lag_lr`: This parameter specifies the learning rate to learn the masking variables and Lagrangian multipliers during pruning. The default value is $1.0$. \n\nThe pruning-specific arguments are all grouped under `model.l0_module`:  \n\n- `model.l0_module.lagrangian_warmup_steps`: \nIn the initial warm-up phase, the pruning rate incrementally rises from 0 to reach the desired target value. The specific target value is determined by the predefined structure of the target model. It's important to note that this value might differ from the warm-up steps associated with learning rates. Typically, we allocate approximately 20% of the total number of steps for this pruning warm-up process.   \n- `model.l0_module.pruning_modules`: By default, this setting prunes various aspects of the model, including the head, intermediate dimensions, hidden dimensions, and layers.  \n- `model.l0_module.eval_target_model`: When set to true, the evaluation process assesses a submodel that exactly matches the target model's structure. If set to false, the evaluation process considers the current model, taking into account the masking values. Since the mask may take some time to converge to the target model shape, we evaluate based on the current model shape rather than the target structure during training.  \n- `model.l0_module.target_model.d_model`: Specifies the hidden dimension of the target model.  \n- `model.l0_module.target_model.n_heads`: Specifies the number of heads in the target model.  \n- `model.l0_module.target_model.n_layers`: Specifies the number of layers in the target model.  \n- `model.l0_module.target_model.intermediate_size`: Specifies the number of intermediate dimensions in the target model.\n\nThese parameters allow you to configure and control the pruning process according to your specific requirements.\n\n### Dynamic batch loading configurations\nWe extend [Steaming's StreamingDataset](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fa94476b2f2d29f929a6963d774fd1a8d68efbab2\u002Fstreaming\u002Fbase\u002Fdataset.py#L170) in [datasets\u002Fstreaming_dataset.py](datasets\u002Fstreaming_dataset.py) to support loading data dynamically. The parameters for configuring dynamic batch loading are primarily defined within the `DynamicLoadingCallback`. Most of the following configurations can be specified in a YAML configuration file under the `callbacks.data_loading` section. Here's an explanation of each parameter:\n\n- `callbacks.data_loading.dynamic`: This boolean parameter determines whether dynamic data loading is enabled. When set to true, data is loaded dynamically from various domains or streams. If set to false, dynamic data loading is disabled.\n- `callbacks.data_loading.set_names`: Specify the domain names or stream names that will be used for dynamic data loading.\n- `callbacks.data_loading.proportion`: This parameter defines the initial data loading proportion for each domain or stream. The sum of all proportions must equal 1, indicating the relative weights of each source in the initial data loading configuration.\n- `callbacks.data_loading.update_type`: Choose the update type for adjusting the data loading proportions during training. There are two options\n    - `doremi`: In this mode, the data loading proportions are updated using an exponential descent approach, similar to the method described in [Doremi](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10429). This allows for adaptive adjustment of data loading proportions over time.\n    - `constant`: Selecting this option keeps the data loading proportions constant throughout training. It's equivalent to disabling dynamic data loading.\n- `callbacks.data_loading.target_loss`: Specify the target validation loss for the training process. This target loss value should be calculated or predetermined before training begins. The loading proportions will be dynamically adjusted based on the difference between the model's current loss and the target loss. This adjustment helps guide the training process towards the desired performance level. \n- `eval_interval`: Determine how often evaluations are performed during training. If `dynamic=True`, the data loading proportion will be adjusted after each evaluation.\n\nThe code is designed to exclusively accommodate local data and does not support remote streaming data. Additionally, it currently only functions with a single worker for the dataloader and does not offer prefetch support. In our testing, this restriction has does not incur any additional compute overhead.\n\n## Throughput\nHere is the throughout of running the pruning and continued pretraining step with A100 80GB GPUs. The throughput is quantified in terms of tokens processed per second. Please refer to the standard throughput of [llm-foundry](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fllm-foundry\u002Fblob\u002Fdd15791818fa53ae792de66d3529d94e0dcb83d9\u002Fscripts\u002Ftrain\u002Fbenchmarking\u002FREADME.md#a100-80gb-with-1600-gbps-node-node-interconnect-roce).\n\n|           | GPUs           | Throughput per Device | Throughput   |\n|-----------|----------------|------------------------|--------------|\n| Pruning 7B| 8              | 1844                   | 14750        |\n| Pre-training 3B | 16       | 4957                   | 79306        |\n| Pre-training 1.3B | 16     | 8684                   | 138945       |\n\n\n## Future Work\n\n**Source models**: While large models are undoubtedly powerful and have the potential to become stronger in the near future, we believe that small-scale models (those with fewer than 7 billion parameters) have untapped potential. However, there is little effort dedicated to making small models stronger, and our work pushes towards this goal. A natural extension of this work is to extend the codebase to prune\n- Stronger base models, such as [Mistral-7B](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fannouncing-mistral-7b\u002F)\n- Domain-specific language models such as code base models, including [CodeLlama](https:\u002F\u002Fhuggingface.co\u002Fcodellama), and [DeepSeek-Coder](https:\u002F\u002Fhuggingface.co\u002Fcodellama)\n- Models from different scales. We mainly worked with 7B models due to computational constraints. It's unclear if pruning from larger models will be more beneficial.\n\nTo adapt the codebase to other models, one key component is to make sure that running the model with masks is equivalent to running the pruned model. We use [llmshearing\u002Futils\u002Ftest_pruning.py](llmshearing\u002Futils\u002Ftest_pruning.py) to run such tests to ensure the correctness of the function `prune_params` in model files. \n\n**Data Sources**: Please keep in mind that the performance of the resulting model is contingent not only on the pruning algorithm and the base model but also on the quality of the data. In our experiments, we mainly worked the [RedPajama v1 data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftogethercomputer\u002FRedPajama-Data-1T). However, here are some additional resources that could be considered for inclusion:\n\n- [Dolma data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fdolma), a 3T pre-training dataset including domains of CommonCrawl, C4, peS2o, The Stack, Project Gutenberg and Wikipedia\n- [proof-pile-2](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEleutherAI\u002Fproof-pile-2), a 55 billion token dataset of mathematical and scientific documents.\n- [RedPajama-v2](https:\u002F\u002Ftogether.ai\u002Fblog\u002Fredpajama-data-v2), a 30T token pre-training dataset.\n\n\n\n## Bugs or Questions?\nIf you have any questions related to the code or the paper, feel free to email Mengzhou (mengzhou@princeton.edu). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!\n\n\n## Citation\nPlease cite our paper if you find the repo helpful in your work:\n\n```bibtex\n@article{xia2023sheared,\n  title={Sheared llama: Accelerating language model pre-training via structured pruning},\n  author={Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi},\n  journal={arXiv preprint arXiv:2310.06694},\n  year={2023}\n}\n```\n\n\n","# 🦙 剪枝版 LLaMA：通过结构化剪枝加速语言模型预训练\n\n🌟 [ArXiv 预印本](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06694) | [博客文章](https:\u002F\u002Fxiamengzhou.github.io\u002Fsheared-llama\u002F)  \n\n基础模型：[Sheared-LLaMA-1.3B](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-1.3B) | [Sheared-LLaMA-2.7B](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-2.7B) | [Sheared-Pythia-160m](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-Pythia-160m\u002Ftree\u002Fmain)  \n未进行持续预训练的剪枝模型：[Sheared-LLaMA-1.3B-Pruned](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-1.3B-Pruned), [Sheared-LLaMA-2.7B-Pruned](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-2.7B-Pruned)  \n指令微调模型：[Sheared-LLaMA-1.3B-ShareGPT](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-1.3B-ShareGPT) | [Sheared-LLaMA-2.7B-ShareGPT](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-2.7B-ShareGPT)\n\n感谢您对我们工作的关注！本项目由 [Mengzhou Xia](https:\u002F\u002Fxiamengzhou.github.io\u002F)、[Tianyu Gao](https:\u002F\u002Fgaotianyu.xyz\u002Fabout\u002F)、[Zhiyuan Zeng](https:\u002F\u002Fzhiyuan-zeng.github.io\u002F) 和 [Danqi Chen](https:\u002F\u002Fwww.cs.princeton.edu\u002F~danqic\u002F) 共同完成。在此，我们提供了 Sheared-LLaMA 的剪枝与持续预训练算法代码库 :) 我们发现，相较于从头开始预训练小型语言模型，对强大的基础模型进行剪枝是一种极具成本效益的方法。下图显示，在已有 Llama-2-7B 模型（使用 2T tokens 预训练）的情况下，对其进行剪枝后得到的模型性能可媲美 OpenLLaMA 模型，而其预训练成本仅为后者的 3%。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fprinceton-nlp_LLM-Shearing_readme_22ed03b2ca0e.jpg\" alt=\"teaser\" width=\"400\" \u002F>\n\n**更新**\n- [2023年12月19日] 更新了仓库中的 [评估脚本](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Ficl_eval) 和剪枝日志。\n- [2023年11月22日] 我们发布了指令微调模型 [Sheared-LLaMA-1.3B-ShareGPT](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-1.3B-ShareGPT) 和 [Sheared-LLaMA-2.7B-ShareGPT](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-LLaMA-2.7B-ShareGPT)。\n- [2023年11月19日] 我们发布了早期阶段开发的 [Sheared-Pythia-160m](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FSheared-Pythia-160m) 模型。该模型采用相同的剪枝配方和 Pile 数据集生成。\n- [2023年11月5日] 我们开源了 LLM-Shearing 的代码——很高兴看到它被应用于更多不同规模的模型。\n- [2023年10月10日] 我们发表了 Sheared-LLaMA 论文，并发布了两款 Sheared LLaMA 模型，同时在 Twitter 上进行了分享 🚀！\n\n## 🔗 快速链接\n- [🦙 剪枝版 LLaMA：通过结构化剪枝加速语言模型预训练](#-sheared-llama-accelerating-language-model-pre-training-via-structured-pruning)\n  - [🔗 快速链接](#-quick-links)\n  - [简要介绍](#brief-introduction)\n  - [安装要求](#install-requirements)\n  - [数据准备](#data-preparation)\n  - [模型准备](#model-preparation)\n  - [剪枝与持续预训练示例脚本](#sample-scripts-for-pruning-and-continued-pre-training)\n  - [转换剪枝模型](#convert-pruned-model)\n  - [将 Composer 模型转换为 Hugging Face 模型](#convert-composer-model-to-huggingface-model)\n  - [训练配置](#training-configurations)\n    - [数据配置](#data-configurations)\n    - [基础训练配置](#basic-training-configurations)\n    - [剪枝配置](#pruning-configurations)\n    - [动态批量加载配置](#dynamic-batch-loading-configurations)\n  - [吞吐量](#throughput)\n  - [未来工作](#future-work)\n  - [遇到问题或有疑问？](#bugs-or-questions)\n  - [引用](#citation)\n\n\n## 简要介绍 \n本代码库基于 MosaicML 令人惊叹的 [Composer 包](https:\u002F\u002Fgithub.com\u002Fmosaicml)，该包专为大规模语言模型预训练设计并进行了优化。整个实现，包括 `剪枝` 逻辑和 `动态批量加载` 逻辑，均以回调函数的形式实现，无需修改原生 Composer 训练器。以下是代码库中各文件夹的简要概述：\n- `shearing.data`: 包含示例数据和数据处理脚本。\n- `shearing.datasets`: 实现自定义数据集，以支持动态数据加载。\n- `shearing.callbacks`: 实现动态加载回调函数和剪枝回调函数。\n- `shearing.models`: 实现模型文件。\n- `shearing.scripts`: 包含运行代码的脚本。\n- `shearing.utils`: 包括所有实用工具函数，例如模型转换和剪枝测试。\n- `train.py`: 运行代码的主要入口\n\n\n\n\n## 安装要求\n**步骤 1**：要开始使用此仓库，您需要按照以下安装步骤操作。在继续之前，请确保已安装 [Pytorch](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Fprevious-versions\u002F) 和 [Flash Attention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention)。您可以通过以下 pip 命令完成安装：\n```\npip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\npip install flash-attn==1.0.3.post\n```\n请注意，目前不支持 Flash Attention 2 版本，可能需要手动修改模型文件。\n\n**步骤 2**：然后安装其余所需软件包：\n```\ncd llmshearing\npip install -r requirement.txt\n```\n\n**步骤 3**：最后，以可编辑模式安装 `llmshearing` 包，以便在您的开发环境中使用：\n```\npip install -e .\n```\n\n\n## 数据准备\n有关如何使用 Mosaicml 的 [Streaming](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming) 包准备数据的详细信息，请参阅 [llmshearing\u002Fdata](llmshearing\u002Fdata)。\n\n## 模型准备\n要将 Hugging Face 转换器模型与 Composer 结合使用，您需要将模型权重转换为 Composer 所期望的密钥格式。以下是将 Hugging Face 模型 'llama2' 的权重转换为 Composer 兼容格式的示例：\n```\n# 定义 Hugging Face 模型名称和输出路径\nHF_MODEL_NAME=meta-llama\u002FLlama-2-7b-hf\nOUTPUT_PATH=models\u002FLlama-2-7b-composer\u002Fstate_dict.pt\n\n# 如果目录不存在，则创建\nmkdir -p $(dirname $OUTPUT_PATH)\n\n# 将 Hugging Face 模型转换为 Composer 密钥格式\npython3 -m llmshearing.utils.composer_to_hf save_hf_to_composer $HF_MODEL_NAME $OUTPUT_PATH\n```\n\n此外，您可以使用以下实用函数来测试 Hugging Face 模型与转换后的 Composer 模型之间的等效性：\n```\nMODEL_SIZE=7B\npython3 -m llmshearing.utils.test_composer_hf_eq $HF_MODEL_NAME $OUTPUT_PATH $MODEL_SIZE\n```\n\n这些函数仅适用于 LLaMA\u002FLLaMA2 模型。不过，将其适配到其他模型（如 Mistral-7B）应该并不困难。\n\n## 剪枝与继续预训练的示例脚本\n对于剪枝，您可以参考位于 [`llmshearing\u002Fscripts\u002Fpruning.sh`](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Fllmshearing\u002Fscripts\u002Fpruning.sh) 的示例脚本。在该脚本中，您需要进行调整以纳入 [数据配置]()、[基础训练配置]()、[剪枝配置]() 和 [动态批量加载配置]()。\n\n由于剪枝相比继续预训练的计算成本相对较高，我们会在特定步数后（通常在所有实验中为 3200 步）停止使用剪枝目标的训练。随后，我们将对剪枝后的模型进行进一步的预训练。为确保兼容性，有必要将模型的状态字典键转换为符合标准目标模型结构的形式。有关此转换的详细说明，请参阅 [转换剪枝模型](#convert-pruned-model)。\n\n完成模型转换后，您可以继续对剪枝模型进行预训练。这一过程与预训练标准模型类似。为此，您可以参考位于 [`llmshearing\u002Fscripts\u002Fcontinue_pretraining.sh`]((https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Fllmshearing\u002Fscripts\u002Fcontinue_pretraining.sh)) 的示例脚本。在此脚本中，剪枝配置已被移除。\n\n训练完成后，您可以使用转换脚本将 Composer 模型转换为 Transformers 模型。更多详情请参阅 [将 Composer 模型转换为 Hugging Face 模型](#convert-composer-model-to-huggingface-model) 部分。\n\n## 转换剪枝模型\n在使用 [`llmshearing\u002Fscripts\u002Fpruning.sh`](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Fllmshearing\u002Fscripts\u002Fpruning.sh) 完成训练后，保存的模型包含源模型的全部参数，并附带一组掩码。接下来，我们将对掩码变量采取以下操作：1) 移除掩码变量接近于 0 的子结构；2) 通过矩阵-向量乘法将掩码变量合并到模型参数中，从而得到一个更为紧凑的模型。同时，还需要重命名权重键，以便能够无缝加载到目标模型架构中，确保各层名称连续一致。\n\n```\nMODEL_PATH=$MODEL_DIR\u002Flatest-rank0.pt\npython3 -m llmshearing.utils.post_pruning_processing prune_and_save_model $MODEL_PATH\n```\n\n剪枝后的模型将保存在 `$(dirname $MODEL_PATH)\u002Fpruned-latest-rank0.pt` 中。\n\n## 将 Composer 模型转换为 Hugging Face 模型\n训练完成后，如果您希望使用 Hugging Face 进行推理或微调，可以使用 [`llmshearing\u002Fscripts\u002Fcomposer_to_hf.py`](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Fllmshearing\u002Futils\u002Fcomposer_to_hf.py) 脚本将您的 Composer 模型转换为 Hugging Face 模型。以下是该脚本的使用示例：\n\n```\nMODEL_PATH=$MODEL_DIR\u002Flatest-rank0.pt\nOUTPUT_PATH=$MODEL_DIR\u002Fhf-latest_rank0\nMODEL_CLASS=LlamaForCausalLM\nHIDDEN_SIZE=2048\nNUM_ATTENTION_HEADS=16\nNUM_HIDDEN_LAYERS=24\nINTERMEDIATE_SIZE=5504\nMODEL_NAME=Sheared-Llama-1.3B\n\npython3 -m llmshearing.utils.composer_to_hf save_composer_to_hf $MODEL_PATH $OUTPUT_PATH \\\n        model_class=${MODEL_CLASS} \\\n        hidden_size=${HIDDEN_SIZE} \\\n        num_attention_heads=${NUM_ATTENTION_HEADS} \\\n        num_hidden_layers=${NUM_HIDDEN_LAYERS} \\\n        intermediate_size=${INTERMEDIATE_SIZE} \\\n        num_key_value_heads=${NUM_ATTENTION_HEADS} \\\n        _name_or_path=${MODEL_NAME}\n```\n\n请注意，此处提到的参数名称是针对 Llama2 的 Hugging Face 配置设计的，在处理其他类型模型时可能会有所不同。\n\n## 训练配置\n本节将深入介绍如何在 YAML 配置文件中设置训练参数。这些配置涵盖多个关键方面，包括数据设置、基础训练设置、剪枝设置以及动态数据加载配置。\n\n### 数据配置\n- `data_local`: 包含数据的本地目录。\n- `eval_loader.dataset.split`: 对于评估，需提供一个包含所有领域数据的组合分割名称。\n- `train_loader.dataset.split`: 当动态加载配置中的 `dynamic=True`（请参阅 [动态加载部分](#dynamic)）时，无需设置此值。然而，若 `dynamic=False`，则必须指定一个训练分割。\n\n### 基础训练配置\n基础训练配置主要遵循原始的 `Composer` 包。有关这些配置的详细信息，请参阅 [Composer 官方文档](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fcomposer\u002Fen\u002Fstable\u002F)。以下是一些需要注意的关键训练参数：\n\n- `max_duration`: 此参数定义了最大训练时长，可按步数（例如 `3200ba`）或轮次（例如 `1ep`）指定。在我们的实验中，剪枝阶段的训练时长设为 `3200ba`，而继续预训练阶段则设为 `48000ba`。\n- `save_interval`: 此参数决定了模型状态保存的频率。我们在剪枝和继续预训练阶段均将其设置为 `3200ba`。\n- `t_warmup`: 此参数指定了学习率调度器的学习率预热时长。在剪枝阶段，其值设为 `320ba`（占总训练时间的 10%），而在继续预训练阶段则设为 1440ba（占总训练时间的 3%）。\n- `optimizer.lr`: 此参数定义了主模型参数的学习率，默认值为 `1e-4`。\n- `max_seq_len`: 按照 Llama 2 的训练方法，我们支持的最大序列长度为 4096。\n- `device_train_microbatch_size`: 此参数决定了每台设备在训练过程中的批次大小。在剪枝阶段，我们将其设置为 `4`，而在继续预训练阶段则提高至 `16`。\n- `global_train_batch_size`: 此参数指定了训练过程中所有 GPU 上的全局批次大小。在剪枝阶段，其值设为 `32`，而在继续预训练阶段则增加至 `256`。\n- `autoresume`: 此参数可通过将其设置为 `true` 来启用断点续跑功能。然而，需要注意的是，尽管我们在继续预训练阶段成功使用过此功能，但尚不能保证其在剪枝阶段的兼容性。\n\n由于计算资源的限制，我们并未进行全面的超参数搜索，因此可能存在更优的超参数以提升性能。\n\n### 剪枝配置\n剪枝过程允许将源模型裁剪为特定的目标结构，脚本中包含一些关键参数，例如：\n\n- `from_model`：该参数指定源模型的大小，并与一个配置文件相对应。\n- `to_model`：该参数定义目标模型的大小，源模型将被剪枝以匹配目标配置。\n- `optimizer.lag_lr`：该参数指定在剪枝过程中用于学习掩码变量和拉格朗日乘子的学习率。默认值为 $1.0$。\n\n剪枝相关的参数都归类在 `model.l0_module` 下：\n\n- `model.l0_module.lagrangian_warmup_steps`：\n在初始预热阶段，剪枝率会从 0 逐步增加到期望的目标值。具体的目标值由目标模型的预定义结构决定。需要注意的是，这个值可能与学习率的预热步数不同。通常，我们会将总步数的约 20% 用于剪枝的预热过程。\n- `model.l0_module.pruning_modules`：默认情况下，此设置会剪枝模型的各个部分，包括头部、中间维度、隐藏维度和层。\n- `model.l0_module.eval_target_model`：当设置为 true 时，评估过程会针对与目标模型结构完全一致的子模型进行评估。如果设置为 false，则评估当前模型，并考虑掩码值。由于掩码可能需要一段时间才能收敛为目标模型形状，因此在训练期间我们基于当前模型形状而非目标结构进行评估。\n- `model.l0_module.target_model.d_model`：指定目标模型的隐藏维度。\n- `model.l0_module.target_model.n_heads`：指定目标模型的注意力头数。\n- `model.l0_module.target_model.n_layers`：指定目标模型的层数。\n- `model.l0_module.target_model.intermediate_size`：指定目标模型的中间维度数量。\n\n这些参数使您可以根据具体需求配置和控制剪枝过程。\n\n### 动态批量加载配置\n我们在 [datasets\u002Fstreaming_dataset.py](datasets\u002Fstreaming_dataset.py) 中扩展了 [Steaming 的 StreamingDataset](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fa94476b2f2d29f929a6963d774fd1a8d68efbab2\u002Fstreaming\u002Fbase\u002Fdataset.py#L170)，以支持动态加载数据。配置动态批量加载的参数主要定义在 `DynamicLoadingCallback` 中。以下大多数配置都可以在 YAML 配置文件的 `callbacks.data_loading` 部分中指定。以下是每个参数的说明：\n\n- `callbacks.data_loading.dynamic`：此布尔参数决定是否启用动态数据加载。当设置为 true 时，数据会从不同的领域或流中动态加载。如果设置为 false，则禁用动态数据加载。\n- `callbacks.data_loading.set_names`：指定用于动态数据加载的领域名称或流名称。\n- `callbacks.data_loading.proportion`：该参数定义每个领域或流的初始数据加载比例。所有比例之和必须等于 1，表示初始数据加载配置中各来源的相对权重。\n- `callbacks.data_loading.update_type`：选择在训练过程中调整数据加载比例的更新方式。有两种选项：\n    - `doremi`：在此模式下，数据加载比例会采用指数下降的方式进行更新，类似于 [Doremi](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10429) 中描述的方法。这使得数据加载比例能够随时间自适应地调整。\n    - `constant`：选择此选项后，数据加载比例在整个训练过程中保持不变。相当于禁用了动态数据加载。\n- `callbacks.data_loading.target_loss`：指定训练过程中的目标验证损失。该目标损失值应在训练开始前计算或预先确定。加载比例会根据模型当前损失与目标损失之间的差异进行动态调整，从而引导训练过程达到预期的性能水平。\n- `eval_interval`：确定训练过程中评估的频率。如果 `dynamic=True`，则每次评估后都会调整数据加载比例。\n\n该代码设计仅适用于本地数据，不支持远程流式数据。此外，目前它仅支持单个工作线程的数据加载器，且不提供预取功能。在我们的测试中，这一限制并未带来额外的计算开销。\n\n## 吞吐量\n以下是使用 A100 80GB GPU 运行剪枝和继续预训练步骤的吞吐量。吞吐量以每秒处理的标记数来衡量。请参考 [llm-foundry](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fllm-foundry\u002Fblob\u002Fdd15791818fa53ae792de66d3529d94e0dcb83d9\u002Fscripts\u002Ftrain\u002Fbenchmarking\u002FREADME.md#a100-80gb-with-1600-gbps-node-node-interconnect-roce) 的标准吞吐量。\n\n|           | GPU 数量   | 每设备吞吐量 | 总吞吐量 |\n|-----------|------------|--------------|----------|\n| 剪枝 7B  | 8          | 1844         | 14750    |\n| 预训练 3B | 16        | 4957         | 79306    |\n| 预训练 1.3B | 16     | 8684         | 138945   |\n\n## 未来工作\n\n**源模型**：尽管大型模型无疑功能强大，并且在不久的将来还有进一步增强的潜力，但我们认为小规模模型（参数量少于70亿）仍具有未被充分挖掘的潜力。然而，目前鲜有研究致力于提升小模型的能力，而我们的工作正是朝着这一目标迈进。这项工作的自然延伸是将代码库扩展到以下方面：\n- 更强大的基础模型，例如 [Mistral-7B](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fannouncing-mistral-7b\u002F)；\n- 领域特定的语言模型，如代码生成模型，包括 [CodeLlama](https:\u002F\u002Fhuggingface.co\u002Fcodellama) 和 [DeepSeek-Coder](https:\u002F\u002Fhuggingface.co\u002Fcodellama)；\n- 不同规模的模型。由于计算资源的限制，我们主要使用了70亿参数的模型进行实验。尚不清楚从更大规模的模型中进行剪枝是否会有更大的收益。\n\n为了使代码库适配其他模型，一个关键步骤是确保使用掩码运行模型与运行剪枝后的模型效果等价。我们使用 [llmshearing\u002Futils\u002Ftest_pruning.py](llmshearing\u002Futils\u002Ftest_pruning.py) 来执行此类测试，以验证模型文件中 `prune_params` 函数的正确性。\n\n**数据来源**：请务必注意，最终模型的性能不仅取决于剪枝算法和基础模型，还取决于数据的质量。在我们的实验中，主要使用了 [RedPajama v1 数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftogethercomputer\u002FRedPajama-Data-1T)。不过，以下是一些可以考虑纳入的额外资源：\n- [Dolma 数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fdolma)，这是一个3TB的预训练数据集，涵盖了 CommonCrawl、C4、peS2o、The Stack、Project Gutenberg 和 Wikipedia 等多个领域；\n- [proof-pile-2 数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEleutherAI\u002Fproof-pile-2)，包含550亿个标记的数学和科学文献；\n- [RedPajama-v2 数据集](https:\u002F\u002Ftogether.ai\u002Fblog\u002Fredpajama-data-v2)，一个30TB的预训练数据集。\n\n\n\n## 发现错误或有疑问？\n如果您对代码或论文有任何疑问，请随时发送邮件至 Mengzhou (mengzhou@princeton.edu)。如果在使用代码时遇到任何问题，或希望报告错误，您可以提交一个问题。请尽可能详细地描述问题，以便我们能够更快更好地帮助您！\n\n\n## 引用\n如果您在工作中发现本仓库有所帮助，请引用我们的论文：\n\n```bibtex\n@article{xia2023sheared,\n  title={Sheared llama: 通过结构化剪枝加速语言模型预训练},\n  author={Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi},\n  journal={arXiv 预印本 arXiv:2310.06694},\n  year={2023}\n}\n```","# LLM-Shearing 快速上手指南\n\nLLM-Shearing 是一个通过结构化剪枝（Structured Pruning）加速大语言模型预训练的工具。它能够将大型基座模型（如 LLaMA-2-7B）剪枝为小型模型，并通过少量继续预训练达到与从头训练相当的效果，显著降低计算成本。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu)\n*   **Python**: 3.8+\n*   **GPU**: 支持 CUDA 的 NVIDIA GPU\n*   **前置依赖**:\n    *   PyTorch (需指定版本以兼容 Flash Attention)\n    *   Flash Attention (注意：目前仅支持 v1 版本，v2 版本需要手动修改代码)\n\n## 安装步骤\n\n请按照以下顺序执行命令来配置环境。\n\n### 1. 安装 PyTorch 和 Flash Attention\n首先安装特定版本的 PyTorch 和 Flash Attention v1。以下命令基于 CUDA 11.8，请根据您的实际 CUDA 版本调整。\n\n```bash\npip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\npip install flash-attn==1.0.3.post\n```\n\n> **提示**：国内用户若下载缓慢，可尝试使用清华或阿里镜像源替换 `--index-url` 或直接使用 `pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple ...`（需注意兼容性）。\n\n### 2. 克隆仓库并安装依赖\n克隆项目代码并安装 `requirements.txt` 中的其他依赖包。\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing.git\ncd LLM-Shearing\npip install -r requirement.txt\n```\n\n### 3. 安装本地开发包\n以可编辑模式安装 `llmshearing` 包，以便在开发环境中调用。\n\n```bash\npip install -e .\n```\n\n## 基本使用\n\n本工具的核心流程分为三个阶段：**模型格式转换** -> **剪枝与继续预训练** -> **模型导出**。\n\n### 1. 模型准备（格式转换）\nLLM-Shearing 基于 MosaicML Composer 构建，使用前需将 Hugging Face 格式的模型权重转换为 Composer 格式。以下以 `Llama-2-7b` 为例：\n\n```bash\n# 定义变量\nHF_MODEL_NAME=meta-llama\u002FLlama-2-7b-hf\nOUTPUT_PATH=models\u002FLlama-2-7b-composer\u002Fstate_dict.pt\n\n# 创建目录\nmkdir -p $(dirname $OUTPUT_PATH)\n\n# 执行转换\npython3 -m llmshearing.utils.composer_to_hf save_hf_to_composer $HF_MODEL_NAME $OUTPUT_PATH\n```\n\n*(可选) 验证转换后的模型与原模型等价性：*\n```bash\nMODEL_SIZE=7B\npython3 -m llmshearing.utils.test_composer_hf_eq $HF_MODEL_NAME $OUTPUT_PATH $MODEL_SIZE\n```\n\n### 2. 执行剪枝与继续预训练\n项目提供了示例脚本用于执行剪枝和后续的继续预训练。您需要根据实际需求修改脚本中的配置文件（数据路径、剪枝参数等）。\n\n*   **剪枝阶段**：运行 [`llmshearing\u002Fscripts\u002Fpruning.sh`](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Fllmshearing\u002Fscripts\u002Fpruning.sh)。\n    *   该阶段通常会运行固定步数（例如 3200 步），目的是移除冗余结构。\n*   **继续预训练阶段**：剪枝完成后，运行 [`llmshearing\u002Fscripts\u002Fcontinue_pretraining.sh`](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fblob\u002Fmaster\u002Fllmshearing\u002Fscripts\u002Fcontinue_pretraining.sh) 对剪枝后的模型进行进一步训练以恢复性能。\n\n> **注意**：请务必在脚本中正确配置 `data_local`（数据本地路径）及相关训练超参数。数据准备需参考 MosaicML Streaming 格式。\n\n### 3. 处理剪枝后的模型\n剪枝训练结束后，保存的模型包含原始参数和掩码（masks）。需要运行后处理脚本将掩码应用到位，移除零值子结构，并重整权重键名以适配标准架构。\n\n```bash\nMODEL_PATH=$MODEL_DIR\u002Flatest-rank0.pt\npython3 -m llmshearing.utils.post_pruning_processing prune_and_save_model $MODEL_PATH\n```\n处理后的模型将保存为 `pruned-latest-rank0.pt`。\n\n### 4. 导出为 Hugging Face 格式\n为了方便推理或微调，最后将 Composer 格式的模型转换回 Hugging Face Transformers 格式。\n\n```bash\nMODEL_PATH=$MODEL_DIR\u002Flatest-rank0.pt\nOUTPUT_PATH=$MODEL_DIR\u002Fhf-latest_rank0\nMODEL_CLASS=LlamaForCausalLM\nHIDDEN_SIZE=2048\nNUM_ATTENTION_HEADS=16\nNUM_HIDDEN_LAYERS=24\nINTERMEDIATE_SIZE=5504\nMODEL_NAME=Sheared-Llama-1.3B\n\npython3 -m llmshearing.utils.composer_to_hf save_composer_to_hf $MODEL_PATH $OUTPUT_PATH \\\n        model_class=${MODEL_CLASS} \\\n        hidden_size=${HIDDEN_SIZE} \\\n        num_attention_heads=${NUM_ATTENTION_HEADS} \\\n        num_hidden_layers=${NUM_HIDDEN_LAYERS} \\\n        intermediate_size=${INTERMEDIATE_SIZE} \\\n        num_key_value_heads=${NUM_ATTENTION_HEADS} \\\n        _name_or_path=${MODEL_NAME}\n```\n\n转换完成后，即可在 `$OUTPUT_PATH` 目录下使用标准的 Hugging Face `transformers` 库加载模型进行推理。","某初创团队希望基于 Llama-2-7B 的强大能力，快速定制一款专用于法律文档分析的 1.3B 参数小模型，以部署在资源受限的本地服务器上。\n\n### 没有 LLM-Shearing 时\n- **训练成本高昂**：从头预训练一个 1.3B 模型需要消耗数万 GPU 小时，预算直接超支，且耗时数月。\n- **性能差距明显**：同等参数量下，从零训练的小模型在专业领域的理解力远不如大模型，难以满足法律场景的准确性要求。\n- **资源门槛过高**：团队缺乏千卡集群资源，无法承担大规模预训练所需的算力和电力开销。\n- **迭代周期漫长**：一旦模型效果不佳，重新调整架构或数据再训练的成本极高，导致产品上线遥遥无期。\n\n### 使用 LLM-Shearing 后\n- **成本降低 97%**：直接对现有的 Llama-2-7B 进行结构化剪枝并继续预训练，仅需原本 3% 的计算成本即可获得同等性能的 1.3B 模型。\n- **继承大模型能力**：剪枝后的 Sheared-LLaMA 完美继承了母版模型的知识底蕴，在法律术语理解和逻辑推理上表现卓越。\n- **单机即可启动**：大幅降低的算力需求使得团队利用少量消费级显卡或小型云实例即可完成微调与部署。\n- **快速验证落地**：将原本数月的研发周期压缩至数天，团队能迅速根据反馈调整策略，实现产品的敏捷迭代。\n\nLLM-Shearing 通过“剪枝 + 继续预训练”的创新路径，让中小团队也能以极低成本拥有媲美大厂的大模型衍生能力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fprinceton-nlp_LLM-Shearing_22ed03b2.jpg","princeton-nlp","Princeton Natural Language Processing","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fprinceton-nlp_9459cd72.png","",null,"http:\u002F\u002Fnlp.cs.princeton.edu","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp",[80,84,88],{"name":81,"color":82,"percentage":83},"Python","#3572A5",91.4,{"name":85,"color":86,"percentage":87},"Shell","#89e051",7.5,{"name":89,"color":90,"percentage":91},"Jupyter Notebook","#DA5B0B",1.2,643,58,"2026-04-03T09:27:38","MIT",4,"Linux","必需 NVIDIA GPU，需安装 CUDA 11.8 (cu118)，依赖 Flash Attention 1.0.3.post (不支持 v2)","未说明",{"notes":101,"python":99,"dependencies":102},"该项目基于 MosaicML Composer 构建。目前仅正式支持 LLaMA\u002FLLaMA2 模型（虽可适配 Mistral-7B 但需手动修改）。Flash Attention 2 版本暂不支持，若使用需手动修改模型文件。安装时需指定 PyTorch 的 cu118 版本。",[103,104,105,106,107,108],"torch==2.0.1+cu118","torchvision==0.15.2+cu118","torchaudio==2.0.2","flash-attn==1.0.3.post","mosaicml-composer","mosaicml-streaming",[14,35],[111,112,113,114,115,116,117],"efficiency","llm","nlp","pre-training","pruning","llama","llama2","2026-03-27T02:49:30.150509","2026-04-12T07:53:29.499096",[121,126,131,136,141],{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},30309,"运行剪枝脚本时训练卡在 'Starting Training' 之后，GPU 利用率低且无报错，如何解决？","这是一个已知问题，通常是因为训练卡在 'spinning the dataloader' 阶段。这可能是 llm-foundry 的一个潜在 bug。解决方案是清理共享内存（shared memory）。你可以尝试运行命令 `streaming_clean_shared_memory` 或者手动删除 `\u002Fdev\u002Fshm` 下的相关文件来释放资源，然后重新运行脚本。","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fissues\u002F53",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},30310,"训练日志中 LanguageCrossEntropy 和 Perplexity 显示为 nan，且计数为 0，是什么原因？","这通常是因为数据集文件夹名称与脚本中配置的 `set_names` 不匹配导致的。请检查你的 MDS 数据文件中每个数据点的 `set` 条目，并确保所有文件夹的名称与你在脚本参数中传递的 `set_names` 完全一致。对齐文件夹名称后通常可以解决该问题。","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fissues\u002F9",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},30311,"遇到 'AssertionError: Currently only supports dynamic loading from each domain for once' 错误怎么办？","该错误表明当前版本仅支持从每个域动态加载数据一次（即 epoch 必须为 0）。如果你试图进行多轮训练或删除断言后遇到 IndexError，说明数据流已耗尽。目前建议确保配置中只进行单轮动态加载，或者检查是否使用了官方提供的 Google Drive 共享数据集，因为样例数据集可能存在配置差异。","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fissues\u002F15",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},30312,"为了获得最佳性能，应该先剪枝基座模型再微调，还是先微调再剪枝？","通常认为语言模型的能力主要来自于预训练，因此最合理且整洁的执行方式是先对基座模型（base model）进行剪枝，然后再在剪枝后的模型上进行微调。虽然混合预训练和微调数据进行剪枝可能有助于找到更符合指令的子模型，但其具体性能表现尚不确定，优先推荐“先剪枝后微调”的流程。","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fissues\u002F22",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},30313,"continue_pretrain.sh 脚本中定义的路径变量未被使用，模型参数应该在哪里传递？","这是一个脚本中的遗留问题，定义的路径确实未被直接使用。用户被鼓励通过提交 Pull Request (PR) 来修复此问题。在修复前，请检查相关 PR（如 #30）中提到的更改，或直接在命令行参数中显式传递模型路径参数，而不是依赖脚本中未使用的变量。","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing\u002Fissues\u002F24",[]]