[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-horseee--Awesome-Efficient-LLM":3,"tool-horseee--Awesome-Efficient-LLM":61},[4,18,26,36,44,52],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",141543,2,"2026-04-06T11:32:54",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":53,"name":54,"github_repo":55,"description_zh":56,"stars":57,"difficulty_score":10,"last_commit_at":58,"category_tags":59,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,60],"视频",{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":87,"forks":88,"last_commit_at":89,"license":90,"difficulty_score":91,"env_os":92,"env_gpu":93,"env_ram":93,"env_deps":94,"category_tags":97,"github_topics":98,"view_count":32,"oss_zip_url":90,"oss_zip_packed_at":90,"status":17,"created_at":107,"updated_at":108,"faqs":109,"releases":110},4382,"horseee\u002FAwesome-Efficient-LLM","Awesome-Efficient-LLM","A curated list for Efficient Large Language Models","Awesome-Efficient-LLM 是一个专为高效大语言模型（LLM）打造的精选资源库，旨在汇集学术界与工业界在模型轻量化领域的最新成果。随着大模型规模日益庞大，其高昂的计算成本、显存占用及推理延迟成为落地应用的主要瓶颈。Awesome-Efficient-LLM 通过系统性地整理前沿论文、开源代码及技术综述，为开发者提供了一套完整的优化解决方案。\n\n该资源库覆盖了从模型训练到部署的全链路优化技术，包括网络剪枝、知识蒸馏、量化压缩、推理加速、混合专家模型（MoE）优化、KV 缓存压缩以及低秩分解等核心方向。此外，它还特别关注硬件适配、高效微调及新兴的高效推理模型研究，并定期更新高引用的推荐论文。\n\n无论是致力于算法创新的研究人员，还是需要将大模型部署到资源受限环境的工程师，都能在这里快速找到所需的技术路径和参考实现。通过追踪这一动态更新的列表，用户可以轻松掌握如何在不显著牺牲模型性能的前提下，大幅降低算力需求，推动大模型在更多场景下的高效应用。","# Awesome-Efficient-LLM\nA curated list for **Efficient Large Language Models**\n\n## Full List\n  - [Network Pruning \u002F Sparsity](pruning.md)\n  - [Knowledge Distillation](knowledge_distillation.md)\n  - [Quantization](quantization.md)\n  - [Inference Acceleration](inference_acceleration.md)\n  - [Efficient MOE](efficient_moe.md)\n  - [Efficient Architecture of LLM](efficient_architecture_llm.md)\n  - [KV Cache Compression](kv_cache_compression.md)\n  - [Text Compression](text_compression.md)\n  - [Low-Rank Decomposition](low_rank_decomposition.md)\n  - [Hardware \u002F System \u002F Serving](hardware.md)\n  - [Efficient Fine-tuning](tuning.md)\n  - [Efficient Training](efficient_training.md)\n  - [Survey or Benchmark](survey.md)\n  - [Reasoning Model](https:\u002F\u002Fgithub.com\u002Ffscdc\u002FAwesome-Efficient-Reasoning-Models)\n\n### Please check out all the papers by selecting the sub-area you're interested in. On this main page, only papers released in the past 90 days are shown.\n\n#### 🚀 Updates\n* April 15, 2025: We have a new [curated list](https:\u002F\u002Fgithub.com\u002Ffscdc\u002FAwesome-Efficient-Reasoning-Models) for **efficient reasoning model**!\n* May 29, 2024: We've had this awesome list for a year now :smiling_face_with_three_hearts:! \n* Sep 6, 2023: Add a new subdirectory [project\u002F](project\u002F) to organize efficient LLM projects.\n* July 11, 2023: A new subdirectory [efficient_plm\u002F](efficient_plm\u002F) is created to house papers that are applicable to PLMs. \n\n#### 💮 Contributing\n\nIf you'd like to include your paper, or need to update any details such as conference information or code URLs, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in `generate_item.py` and execute `python generate_item.py`. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience. \n\n#### :star: Recommended Paper\n\nFor each topic, we have curated a list of recommended papers that have garnered a lot of GitHub stars or citations.\n\n\n## Paper from Sep 30, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))\n\n### Quick Link \n  - [Network Pruning \u002F Sparsity](#network-pruning--sparsity)\n  - [Knowledge Distillation](#knowledge-distillation)\n  - [Quantization](#quantization)\n  - [Inference Acceleration](#inference-acceleration)\n  - [Efficient MOE](#efficient_moe)\n  - [Efficient Architecture of LLM](#efficient-architecture-of-llm)\n  - [KV Cache Compression](#kv-cache-compression)\n  - [Text Compression](#text-compression)\n  - [Low-Rank Decomposition](#low-rank-decomposition)\n  - [Hardware \u002F System \u002F Serving](#hardwaresystemserving)\n  - [Efficient Fine-tuning](#efficient-fine-tuning)\n  - [Efficient Training](#efficient-training)\n  - [Survey](#survey-or-benchmark)\n\n### Network Pruning \u002F Sparsity\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n| [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fsparsegpt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fsparsegpt) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICML'23-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FUnstructured-C2A4A6)]() \u003Cbr> :star: [SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fsparsegpt) \u003Cbr> Elias Frantar, Dan Alistarh| \u003Cimg width=\"522\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_ac4a426f6bcd.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fsparsegpt) [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.00774) | [\u002F\u002F]: #Recommend\n| [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhorseee\u002FLLM-Pruner.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhorseee\u002FLLM-Pruner) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'23-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FStructural-C2A4A6)]() \u003Cbr> :star: [LLM-Pruner: On the Structural Pruning of Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11627) \u003Cbr> Xinyin Ma, Gongfan Fang, Xinchao Wang | \u003Cimg width=\"561\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5cf3348735fa.png\">| [Github](https:\u002F\u002Fgithub.com\u002Fhorseee\u002FLLM-Pruner) [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11627)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flocuslab\u002Fwanda.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Flocuslab\u002Fwanda) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'24-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FUnstructured-C2A4A6)]()  \u003Cbr> :star: [A Simple and Effective Pruning Approach for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11695) \u003Cbr> Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_521ecaeb6396.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Flocuslab\u002Fwanda) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11695)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-nlp\u002FLLM-Shearing.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'24-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FStructural-C2A4A6)]() \u003Cbr> :star: [Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06694) \u003Cbr> Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_9468b81db3a7.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06694)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FMaskLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FMaskLLM) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSemi_Structured-C2A4A6)]() \u003Cbr> :star: [MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17481) \u003Cbr> Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, Xinchao Wang |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f80cb94f4200.gif\"> |[Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FMaskLLM) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17481)|[\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FMamba-Shedder) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NAACL'25-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FStructural-C2A4A6)]() \u003Cbr>[Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17088) \u003Cbr> Juan Pablo Munoz, Jinjie Yuan, Nilesh Jain |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5f33e950b549.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FMamba-Shedder) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17088)|[\u002F\u002F]: #01\u002F28\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FMultiPruner) [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FStructural-C2A4A6)]() \u003Cbr>[MultiPruner: Balanced Structure Removal in Foundation Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09949) \u003Cbr> Juan Pablo Munoz, Jinjie Yuan, Nilesh Jain |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_036bbbcf80a2.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FMultiPruner) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09949)|[\u002F\u002F]: #01\u002F17\n|[HashAttention: Semantic Sparsity for Faster Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14468) \u003Cbr> Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f5a1e0346d11.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14468)|[\u002F\u002F]: #12\u002F30\n|[Adaptive Pruning for Large Language Models with Structural Importance Awareness](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15127) \u003Cbr> Haotian Zheng, Jinke Ren, Yushan Sun, Ruichen Zhang, Wenbo Zhang, Zhen Li, Dusit Niyato, Shuguang Cui, Yatong Han |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c8ce41ddfa58.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15127)|[\u002F\u002F]: #12\u002F30\n|[SlimGPT: Layer-wise Structured Pruning for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18110) \u003Cbr> Gui Ling, Ziyang Wang, Yuliang Yan, Qingwen Liu |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c6934c04aa82.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18110)|[\u002F\u002F]: #12\u002F30\n|[Less is More: Towards Green Code Large Language Models via Unified Structural Pruning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15921) \u003Cbr> Guang Yang, Yu Zhou, Xiangyu Zhang, Wei Cheng, Ke Liu, Xiang Chen, Terry Yue Zhuo, Taolue Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_aea60dd40969.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15921)|[\u002F\u002F]: #12\u002F30\n|[Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01380) \u003Cbr> Marco Federici, Davide Belli, Mart van Baalen, Amir Jalalirad, Andrii Skliar, Bence Major, Markus Nagel, Paul Whatmough |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f29176d3d795.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01380)|[\u002F\u002F]: #12\u002F09\n|[Puzzle: Distillation-Based NAS for Inference-Optimized LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19146) \u003Cbr> Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah et al |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7d2818aba4cd.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19146)|[\u002F\u002F]: #12\u002F09\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyaolu-zjut\u002FNavigation-LLM-layer-pruning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyaolu-zjut\u002FNavigation-LLM-layer-pruning)\u003Cbr>[Reassessing Layer Pruning in LLMs: New Insights and Methods](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15558) \u003Cbr> Yao Lu, Hao Cheng, Yujie Fang, Zeyu Wang, Jiaheng Wei, Dongwei Xu, Qi Xuan, Xiaoniu Yang, Zhaowei Zhu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_62442edfe904.jpg\"> |[Github](https:\u002F\u002Fgithub.com\u002Fyaolu-zjut\u002FNavigation-LLM-layer-pruning) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15558)|[\u002F\u002F]: #12\u002F03\n|[Layer Importance and Hallucination Analysis in Large Language Models via Enhanced Activation Variance-Sparsity](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10069) \u003Cbr> Zichen Song, Sitan Huang, Yuxin Wu, Zhongfeng Kang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_01be2c14c9b9.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10069)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGATECH-EIC\u002FAmoebaLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FGATECH-EIC\u002FAmoebaLLM)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24-blue)]()\u003Cbr>[AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10606) \u003Cbr> Yonggan Fu, Zhongzhi Yu, Junwei Li, Jiayi Qian, Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, Yingyan Celine Lin |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_314967159f26.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FGATECH-EIC\u002FAmoebaLLM) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10606)|[\u002F\u002F]: #11\u002F24\n|[Scaling Law for Post-training after Model Pruning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10272) \u003Cbr> Xiaodong Chen, Yuxuan Hu, Jing Zhang, Xiaokang Zhang, Cuiping Li, Hong Chen | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10272)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhexuandeng\u002FDRPruning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhexuandeng\u002FDRPruning)\u003Cbr>[DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14055) \u003Cbr> Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Min Zhang, Zhaopeng Tu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_935629aee5fb.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fhexuandeng\u002FDRPruning) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14055)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthunlp\u002FSparsingLaw.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FSparsingLaw)\u003Cbr>[Sparsing Law: Towards Large Language Models with Greater Activation Sparsity](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02335) \u003Cbr> Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5cb5a1756fd0.jpg\"> |[Github](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FSparsingLaw) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02335)|[\u002F\u002F]: #11\u002F18\n|[AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02117) \u003Cbr> Zichen Song, Yuxin Wu, Sitan Huang, Zhongfeng Kang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d08979eccf5f.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02117)|[\u002F\u002F]: #11\u002F18\n|[Tailored-LLaMA: Optimizing Few-Shot Learning in Pruned LLaMA Models with Task-Specific Prompts](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19185) \u003Cbr> Danyal Aftab, Steven Davy |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_77fb0596e5e8.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19185)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAboveParadise\u002FLLMCBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAboveParadise\u002FLLMCBench)\u003Cbr>[LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21352) \u003Cbr> Ge Yang, Changyi He, Jinyang Guo, Jianyu Wu, Yifu Ding, Aishan Liu, Haotong Qin, Pengliang Ji, Xianglong Liu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_deaae41a402e.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FAboveParadise\u002FLLMCBench) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21352)|[\u002F\u002F]: #11\u002F17\n|[Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16135) \u003Cbr> Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_73be5b39d1de.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16135)|[\u002F\u002F]: #10\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002FEvoPress.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002FEvoPress)\u003Cbr>[EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14649) \u003Cbr> Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_59ae0da88dd3.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002FEvoPress) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14649)|[\u002F\u002F]: #10\u002F30\n|[FedSpaLLM: Federated Pruning of Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14852) \u003Cbr> Guangji Bai, Yijiang Li, Zilinghan Li, Liang Zhao, Kibaek Kim |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_91ef9eeedf69.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14852)|[\u002F\u002F]: #10\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpiuzha\u002FAPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpiuzha\u002FAPT)\u003Cbr>[Pruning Foundation Models for High Accuracy without Retraining](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15567) \u003Cbr> Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin | |[Github](https:\u002F\u002Fgithub.com\u002Fpiuzha\u002FAPT) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15567)|[\u002F\u002F]: #10\u002F30\n|[Self-calibration for Language Model Quantization and Pruning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17170) \u003Cbr> Miles Williams, George Chrysostomou, Nikolaos Aletras |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_cff521848222.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17170)|[\u002F\u002F]: #10\u002F29\n|[Beware of Calibration Data for Pruning Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17711) \u003Cbr> Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17711)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaiquanlu\u002FAlphaPruning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhaiquanlu\u002FAlphaPruning)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24-blue)]()\u003Cbr>[AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10912) \u003Cbr> Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W. Mahoney, Yaoqing Yang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7ca13d0304b2.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fhaiquanlu\u002FAlphaPruning) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10912)|[\u002F\u002F]: #10\u002F21\n|[Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11261) \u003Cbr> Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_4a68705f5aff.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11261)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZhengaoLi\u002FDISP-LLM-Dimension-Independent-Structural-Pruning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZhengaoLi\u002FDISP-LLM-Dimension-Independent-Structural-Pruning)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24-blue)]()\u003Cbr>[DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11988) \u003Cbr> Shangqian Gao, Chi-Heng Lin, Ting Hua, Tang Zheng, Yilin Shen, Hongxia Jin, Yen-Chang Hsu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_086bdae2d26c.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FZhengaoLi\u002FDISP-LLM-Dimension-Independent-Structural-Pruning) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11988)|[\u002F\u002F]: #10\u002F21\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24%20Workshop-blue)]()\u003Cbr>[Self-Data Distillation for Recovering Quality in Pruned Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09982) \u003Cbr> Vithursan Thangarasa, Ganesh Venkatesh, Nish Sinnadurai, Sean Lie |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_063d03fee2a3.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09982)|[\u002F\u002F]: #10\u002F21\n|[LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13299) \u003Cbr> David Hoffmann, Kailash Budhathoki, Matthaeus Kleindessner |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_023d4de5db07.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13299)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fabx393\u002Fllm-pruning-calibration-data.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fabx393\u002Fllm-pruning-calibration-data)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24-blue)]()\u003Cbr>[Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07461) \u003Cbr> Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Kumar Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, Shiwei Liu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_feb79788994f.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fabx393\u002Fllm-pruning-calibration-data) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07461)|[\u002F\u002F]: #10\u002F13\n|[Mitigating Copy Bias in In-Context Learning through Neuron Pruning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01288) \u003Cbr> Ameen Ali, Lior Wolf, Ivan Titov |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b2b4e907d64b.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01288)|[\u002F\u002F]: #10\u002F04\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FSQFT)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FUnstructured-C2A4A6)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fw\u002FQuantization-39B0A9)]() \u003Cbr>[SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03750) \u003Cbr> Juan Pablo Munoz, Jinjie Yuan, Nilesh Jain |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e3cafc046795.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FSQFT) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03750)|[\u002F\u002F]: #10\u002F01\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPiotrNawrot\u002Fsparse-frontier.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPiotrNawrot\u002Fsparse-frontier)\u003Cbr>[The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.17768) \u003Cbr> Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_cccb1663ac9d.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FPiotrNawrot\u002Fsparse-frontier) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.17768)|[\u002F\u002F]: #05\u002F05\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwoominsong\u002FSimba.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwoominsong\u002FSimba)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJournal-TMLR_2025-blue)]()\u003Cbr>[Sparsified State-Space Models are Efficient Highway Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20698) \u003Cbr> Woomin Song, Jihoon Tack, Sangwoo Mo, Seunghyuk Oh, Jinwoo Shin |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f6154ee98427.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fwoominsong\u002FSimba) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20698)|[\u002F\u002F]: #06\u002F03\n\n\n\n\n### Knowledge Distillation\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|:star: [Knowledge Distillation of Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.08543) \u003Cbr> Yuxian Gu, Li Dong, Furu Wei, Minlie Huang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0844e23723ab.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLMOps\u002Ftree\u002Fmain\u002Fminillm) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.08543)| [\u002F\u002F]: #Recommend\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-COLING'25-blue)]()\u003Cbr>[Self-Evolution Knowledge Distillation for LLM-based Machine Translation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15303) \u003Cbr> Yuncheng Song, Liang Ding, Changtong Zan, Shujian Huang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_72efe80e6db6.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15303)|[\u002F\u002F]: #12\u002F30\n|[Large Language Models Compression via Low-Rank Feature Distillation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16719) \u003Cbr> Yaya Sy, Christophe Cerisara, Irina Illina |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_1eb14047844d.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16719)|[\u002F\u002F]: #12\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITSZ-HLT\u002FFSA-Distillation.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FHITSZ-HLT\u002FFSA-Distillation)\u003Cbr>[Distilling Fine-grained Sentiment Understanding from Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18552) \u003Cbr> Yice Zhang, Guangyu Xie, Hongling Xu, Kaiheng Hou, Jianzhu Bao, Qianlong Wang, Shiwei Chen, Ruifeng Xu |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_67b22951fef1.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FHITSZ-HLT\u002FFSA-Distillation) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18552)|[\u002F\u002F]: #12\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falonso130r\u002Fknowledge-distillation.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Falonso130r\u002Fknowledge-distillation)\u003Cbr>[Enhancing Knowledge Distillation for LLMs with Response-Priming Prompting](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17846) \u003Cbr> Vijay Goyal, Mustafa Khan, Aprameya Tirupati, Harveer Saini, Michael Lam, Kevin Zhu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_54c57e4de940.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Falonso130r\u002Fknowledge-distillation) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17846)|[\u002F\u002F]: #12\u002F30\n|[Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14698) \u003Cbr> Xunyu Zhu, Jian Li, Can Ma, Weiping Wang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d68f45bf243b.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14698)|[\u002F\u002F]: #12\u002F03\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkaistai\u002FGenPI.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkaistai\u002FGenPI)\u003Cbr>[Generative Prompt Internalization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15927) \u003Cbr> Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_68b9f4975e3d.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fkaistai\u002FGenPI) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15927)|[\u002F\u002F]: #12\u002F02\n|[SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19503) \u003Cbr> Jahyun Koo, Yerin Hwang, Yongil Kim, Taegwan Kang, Hyunkyung Bae, Kyomin Jung |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_480a1ba262b2.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19503)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjdeschena\u002Fsdtt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fjdeschena\u002Fsdtt)\u003Cbr>[Beyond Autoregression: Fast LLMs via Self-Distillation Through Time](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21035) \u003Cbr> Justin Deschenaux, Caglar Gulcehre |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a415fbf37ac6.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fjdeschena\u002Fsdtt) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21035)|[\u002F\u002F]: #11\u002F17\n|[Pre-training Distillation for Large Language Models: A Design Space Exploration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16215) \u003Cbr> Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, Juanzi Li | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16215)|[\u002F\u002F]: #10\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-coai\u002FMiniPLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FMiniPLM)\u003Cbr>[MiniPLM: Knowledge Distillation for Pre-Training Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17215) \u003Cbr> Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_03be076a5a1b.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FMiniPLM) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17215)|[\u002F\u002F]: #10\u002F29\n|[Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11325) \u003Cbr> Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_93add7a80da2.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11325)|[\u002F\u002F]: #10\u002F21\n|[Evolutionary Contrastive Distillation for Language Model Alignment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07513) \u003Cbr> Julian Katz-Samuels, Zheng Li, Hyokun Yun, Priyanka Nigam, Yi Xu, Vaclav Petricek, Bing Yin, Trishul Chilimbi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_40e7a306f69f.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07513)|[\u002F\u002F]: #10\u002F13\n\n\n\n### Quantization\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fgptq.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fgptq)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'22-blue)]()\u003Cbr> :star: [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.17323) \u003Cbr> Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d2d96852491a.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fgptq) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.17323)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fsmoothquant.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fsmoothquant)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICML'23-blue)]() \u003Cbr> :star: [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.10438) \u003Cbr> Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_485158e0b995.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fsmoothquant) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.10438)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fllm-awq.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq) \u003Cbr> :star: [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00978) \u003Cbr> Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c31399b68c00.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00978)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FOmniQuant.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FOmniQuant)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'24-blue)]()\u003Cbr> :star: [OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.13137) \u003Cbr> Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_8f42e8cf1b51.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FOmniQuant) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.13137)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Futkarsh-dmx\u002Fproject-resq.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Futkarsh-dmx\u002Fproject-resq)\u003Cbr>[ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14363) \u003Cbr> Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_133c5ad7f5b8.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Futkarsh-dmx\u002Fproject-resq) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14363)|[\u002F\u002F]: #12\u002F30\n|[MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14590) \u003Cbr> Zhen Zheng, Xiaonan Song, Chuanjie Liu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_406ccb425ac7.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14590)|[\u002F\u002F]: #12\u002F30\n|[GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17560) \u003Cbr> Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Xing Mei, Lean Fu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a8d5bc62471b.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17560)|[\u002F\u002F]: #12\u002F30\n|[LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18135) \u003Cbr> Binrui Zeng, Bin Ji, Xiaodong Liu, Jie Yu, Shasha Li, Jun Ma, Xiaopeng Li, Shangwen Wang, Xinran Hong |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_af005f980304.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18135)|[\u002F\u002F]: #12\u002F30\n|[SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04180) \u003Cbr> Runsheng Bai, Qiang Liu, Bo Liu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_54b951e095f8.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04180)|[\u002F\u002F]: #12\u002F09\n|[CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03599) \u003Cbr> Amitash Nanda, Sree Bhargavi Balija, Debashis Sahoo |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c0ea6a3e486f.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03599)|[\u002F\u002F]: #12\u002F09\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-HPCA'25-blue)]()\u003Cbr>[Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15982) \u003Cbr> Chao Fang, Man Shi, Robin Geens, Arne Symons, Zhongfeng Wang, Marian Verhelst |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_331910be1f6e.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15982)|[\u002F\u002F]: #12\u002F03\n|[MixPE: Quantization and Hardware Co-design for Efficient LLM Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16158) \u003Cbr> Yu Zhang, Mingzi Wang, Lancheng Zou, Wulong Liu, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_ce747b7ba5c9.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16158)|[\u002F\u002F]: #12\u002F03\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fabdelfattah-lab\u002FBitMoD-HPCA-25.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fabdelfattah-lab\u002FBitMoD-HPCA-25)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-HPCA'25-blue)]()\u003Cbr>[BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11745) \u003Cbr> Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_10b7c5d90bb9.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fabdelfattah-lab\u002FBitMoD-HPCA-25) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11745)|[\u002F\u002F]: #11\u002F24\n|[AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09909) \u003Cbr> Janghwan Lee, Jiwoong Park, Jinseok Kim, Yongjik Kim, Jungju Oh, Jinwook Oh, Jungwook Choi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0668b4d10c45.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09909)|[\u002F\u002F]: #11\u002F24\n|[Bi-Mamba: Towards Accurate 1-Bit State Space Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11843) \u003Cbr> Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0dfc5081f9d2.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11843)|[\u002F\u002F]: #11\u002F24\n|[\"Give Me BF16 or Give Me Death\"? Accuracy-Performance Trade-Offs in LLM Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02355) \u003Cbr> Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02355)|[\u002F\u002F]: #11\u002F18\n|[GWQ: Gradient-Aware Weight Quantization for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00850) \u003Cbr> Yihua Shao, Siyu Liang, Xiaolin Lin, Zijian Ling, Zixian Zhu et al  |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e0ea9d819f2f.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00850)|[\u002F\u002F]: #11\u002F18\n|[A Comprehensive Study on Quantization Techniques for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02530) \u003Cbr> Jiedong Lang, Zhehao Guo, Shuyu Huang | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02530)|[\u002F\u002F]: #11\u002F18\n|[BitNet a4.8: 4-bit Activations for 1-bit LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04965) \u003Cbr> Hongyu Wang, Shuming Ma, Furu Wei |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b66d18e35824.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04965)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIntelligent-Computing-Lab-Yale\u002FTesseraQ.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIntelligent-Computing-Lab-Yale\u002FTesseraQ)\u003Cbr>[TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19103) \u003Cbr> Yuhang Li, Priyadarshini Panda |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_05495a561b8a.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIntelligent-Computing-Lab-Yale\u002FTesseraQ) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19103)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxinghaow99\u002FBitStack.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fxinghaow99\u002FBitStack)\u003Cbr>[BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23918) \u003Cbr> Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fxinghaow99\u002FBitStack\u002Fraw\u002Fmain\u002Fassets\u002Fbitstack.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fxinghaow99\u002FBitStack) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23918)|[\u002F\u002F]: #11\u002F17\n|[The Impact of Inference Acceleration Strategies on Bias of LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22118) \u003Cbr> Elisabeth Kirsten, Ivan Habernal, Vedant Nanda, Muhammad Bilal Zafar | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22118)|[\u002F\u002F]: #11\u002F17\n|[Understanding the difficulty of low-precision post-training quantization of large language models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14570) \u003Cbr> Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0863bb6cf799.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14570)|[\u002F\u002F]: #10\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FBitNet.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FBitNet)\u003Cbr>[1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16144) \u003Cbr> Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e24eda1bd4b2.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FBitNet) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16144)|[\u002F\u002F]: #10\u002F30\n|[QuAILoRA: Quantization-Aware Initialization for LoRA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14713) \u003Cbr> Neal Lawton, Aishwarya Padmakumar, Judith Gaspers, Jack FitzGerald, Anoop Kumar, Greg Ver Steeg, Aram Galstyan | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14713)|[\u002F\u002F]: #10\u002F30\n|[Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14766) \u003Cbr> Enkhbold Nyamsuren | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14766)|[\u002F\u002F]: #10\u002F30\n| [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FSqueezeLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezeLLM) \u003Cbr> :star: [SqueezeLLM: Dense-and-Sparse Quantization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.07629.pdf) \u003Cbr>Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer | \u003Cimg width=\"1102\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_52458f45dcf9.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezeLLM) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.07629.pdf)| [\u002F\u002F]: #Recommend\n|[Pyramid Vector Quantization for LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16926) \u003Cbr> Tycho F. A. van der Ouderaa, Maximilian L. Croci, Agrin Hilmkil, James Hensman |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_03dd993c24b7.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16926)|[\u002F\u002F]: #10\u002F29\n|[SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10714) \u003Cbr> Rasoul Shafipour, David Harrison, Maxwell Horton, Jeffrey Marker, Houman Bedayat, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, Saman Naderiparizi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_fddc01e959c3.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10714)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fruikangliu\u002FFlatQuant.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fruikangliu\u002FFlatQuant)\u003Cbr>[FlatQuant: Flatness Matters for LLM Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09426) \u003Cbr> Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_49b9ed3b0545.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fruikangliu\u002FFlatQuant) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09426)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMohammad-Mozaffari\u002Fslim.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMohammad-Mozaffari\u002Fslim)\u003Cbr>[SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09615) \u003Cbr> Mohammad Mozaffari, Maryam Mehri Dehnavi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_26f885d1beb0.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FMohammad-Mozaffari\u002Fslim) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09615)|[\u002F\u002F]: #10\u002F21\n|[Scaling laws for post-training quantized large language models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12119) \u003Cbr> Zifei Xu, Alexander Lan, Wanzin Yazar, Tristan Webb, Sayeh Sharify, Xin Wang |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_64e03596e2e6.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12119)|[\u002F\u002F]: #10\u002F21\n|[Continuous Approximations for Improving Quantization Aware Training of LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10849) \u003Cbr> He Li, Jianhang Hong, Yuanzhuo Wu, Snehal Adbol, Zonglin Li | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10849)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLuoYingSong\u002FDAQ.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLuoYingSong\u002FDAQ)\u003Cbr>[DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12187) \u003Cbr> Yingsong Luo, Ling Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_8eb5f9897357.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FLuoYingSong\u002FDAQ) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12187)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fenyac-group\u002FQuamba.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fenyac-group\u002FQuamba)\u003Cbr>[Quamba: A Post-Training Quantization Recipe for Selective State Space Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13229) \u003Cbr> Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_137619497ba8.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fenyac-group\u002FQuamba) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13229)|[\u002F\u002F]: #10\u002F21\n|[AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13212) \u003Cbr> Qian Tao, Wenyuan Yu, Jingren Zhou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_9df2fb11f2dc.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13212)|[\u002F\u002F]: #10\u002F21\n|[Channel-Wise Mixed-Precision Quantization for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13056) \u003Cbr> Zihan Chen, Bike Xie, Jundong Li, Cong Shen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_48297dd4109e.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13056)|[\u002F\u002F]: #10\u002F21\n|[Progressive Mixed-Precision Decoding for Efficient LLM Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13461) \u003Cbr> Hao Mark Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Hongxiang Fan, Stylianos I. Venieris |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d7e7a905f040.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13461)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAnonymous1252022\u002FEXAQ.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAnonymous1252022\u002FEXAQ)\u003Cbr>[EXAQ: Exponent Aware Quantization For LLMs Acceleration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03185) \u003Cbr> Moran Shkolnik, Maxim Fishman, Brian Chmiel, Hilla Ben-Yaacov, Ron Banner, Kfir Yehuda Levy |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_902f8f4eb704.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FAnonymous1252022\u002FEXAQ) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03185)|[\u002F\u002F]: #10\u002F14\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenMnZ\u002FPrefixQuant.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FChenMnZ\u002FPrefixQuant)\u003Cbr>[PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05265) \u003Cbr> Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0d1a60e04102.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FChenMnZ\u002FPrefixQuant) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05265)|[\u002F\u002F]: #10\u002F14\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvahe1994\u002FAQLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fvahe1994\u002FAQLM)\u003Cbr> :star: [Extreme Compression of Large Language Models via Additive Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06118) \u003Cbr> Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f89e976728fd.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fvahe1994\u002FAQLM) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06118)| [\u002F\u002F]: #Recommend\n|[Scaling Laws for Mixed quantization in Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06722) \u003Cbr> Zeyu Cao, Cheng Zhang, Pedro Gimenes, Jianqiao Lu, Jianyi Cheng, Yiren Zhao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_2c88202f3ad1.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06722)|[\u002F\u002F]: #10\u002F14\n|[PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05315) \u003Cbr> Yilong Li, Jingyu Liu, Hao Zhang, M Badri Narayanan, Utkarsh Sharma, Shuai Zhang, Pan Hu, Yijing Zeng, Jayaram Raghuram, Suman Banerjee |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e535ea49bac1.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05315)|[\u002F\u002F]: #10\u002F14\n|[CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07505) \u003Cbr> Wenyuan Liu, Xindian Ma, Peng Zhang, Yan Wang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_523afd542b61.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07505)|[\u002F\u002F]: #10\u002F13\n|[SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02367) \u003Cbr> Jintao Zhang, Jia wei, Pengle Zhang, Jun Zhu, Jianfei Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_99cbb9f69298.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02367)|[\u002F\u002F]: #10\u002F04\n|[Addition is All You Need for Energy-efficient Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00907) \u003Cbr> Hongyin Luo, Wei Sun |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7bddbfc7dce8.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00907)|[\u002F\u002F]: #10\u002F02\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FGuidedQuant.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FGuidedQuant)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICML'25-blue)]()\u003Cbr>[GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07004) \u003Cbr> Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens JS Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_38c086491872.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FGuidedQuant) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07004)|[\u002F\u002F]: #06\u002F15\n\n\n### Inference Acceleration\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FDejaVu.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFMInference\u002FDejaVu)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICML'23%20Oral-blue)]()\u003Cbr> :star: [Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time](https:\u002F\u002Fopenreview.net\u002Fforum?id=wIPIhHd00i) \u003Cbr> Zichang Liu, Jue WANG, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f36812be4d63.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FFMInference\u002FDejaVu) \u003Cbr> [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=wIPIhHd00i)| [\u002F\u002F]: #Recommend\n| [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fflexflow\u002FFlexFlow.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fflexflow\u002FFlexFlow\u002Ftree\u002Finference) \u003Cbr> :star: [SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.09781) \u003Cbr> Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia| \u003Cimg width=\"600\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_4e13a059dddc.png\">| [Github](https:\u002F\u002Fgithub.com\u002Fflexflow\u002FFlexFlow\u002Ftree\u002Finference) \u003Cbr> [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.09781) | [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fstreaming-llm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fstreaming-llm)\u003Cbr> :star: [Efficient Streaming Language Models with Attention Sinks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17453) \u003Cbr> Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c7620847d34b.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fstreaming-llm) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17453)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSafeAILab\u002FEAGLE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSafeAILab\u002FEAGLE)\u003Cbr>:star: [EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation](https:\u002F\u002Fsites.google.com\u002Fview\u002Feagle-llm) \u003Cbr> Yuhui Li, Chao Zhang, and Hongyang Zhang |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_64a3f8a5a382.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FSafeAILab\u002FEAGLE) \u003Cbr> [Blog](https:\u002F\u002Fsites.google.com\u002Fview\u002Feagle-llm)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FMedusa.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa)\u003Cbr> :star: [Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10774) \u003Cbr> Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7477eb4c9b19.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10774)| [\u002F\u002F]: #Recommend\n|[Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00061) \u003Cbr> Zhuofan Wen, Shangtong Gui, Yang Feng |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_8db7063714b2.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00061)|[\u002F\u002F]: #12\u002F09\n|[PLD+: Accelerating LLM inference by leveraging Language Model Artifacts](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01447) \u003Cbr> Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_4a0c386a8e76.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01447)|[\u002F\u002F]: #12\u002F09\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24%20ENLSP-blue)]()\u003Cbr>[FastDraft: How to Train Your Draft](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11055) \u003Cbr> Ofir Zafrir, Igor Margulis, Dorin Shteyman, Guy Boudoukh | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11055)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDavid-Li0406\u002FSMoA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDavid-Li0406\u002FSMoA)\u003Cbr>[SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03284) \u003Cbr> Dawei Li, Zhen Tan, Peijia Qian, Yifan Li, Kumar Satvik Chaudhary, Lijie Hu, Jiayi Shen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_4f5e7f4a5f32.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FDavid-Li0406\u002FSMoA) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03284)|[\u002F\u002F]: #11\u002F18\n|[The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03786) \u003Cbr> Lawrence Stewart, Matthew Trager, Sujan Kumar Gonugondla, Stefano Soatto | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03786)|[\u002F\u002F]: #11\u002F18\n|[Accelerated AI Inference via Dynamic Execution Methods](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00853) \u003Cbr> Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00853)|[\u002F\u002F]: #11\u002F18\n|[SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04975) \u003Cbr> Gabriele Oliaro, Zhihao Jia, Daniel Campos, Aurick Qiao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_4d71033189b7.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04975)|[\u002F\u002F]: #11\u002F18\n|[Dynamic Strategy Planning for Efficient Question Answering with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23511) \u003Cbr> Tanmay Parekh, Pradyot Prakash, Alexander Radovic, Akshay Shekher, Denis Savenkov |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3037677729c9.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23511)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInfini-AI-Lab\u002FMagicPIG.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FMagicPIG)\u003Cbr>[MagicPIG: LSH Sampling for Efficient LLM Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16179) \u003Cbr> Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_aefee4162d09.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FMagicPIG) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16179)|[\u002F\u002F]: #10\u002F30\n|[Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17765) \u003Cbr> Artem Basharin, Andrei Chertkov, Ivan Oseledets |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5613c1b4d139.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17765)|[\u002F\u002F]: #10\u002F29\n|[Efficient Inference for Augmented Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18248) \u003Cbr> Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d0954155d282.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18248)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMatteoNulli\u002FVocabulary_pruning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMatteoNulli\u002FVocabulary_pruning)\u003Cbr>[Dynamic Vocabulary Pruning in Early-Exit LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18952) \u003Cbr> Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002FMatteoNulli\u002FVocabulary_pruning\u002Fraw\u002Fmain\u002Fsrc\u002Fimages\u002Ffinal_nips.svg\"> |[Github](https:\u002F\u002Fgithub.com\u002FMatteoNulli\u002FVocabulary_pruning) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18952)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangqinsi1\u002FCoreInfer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwangqinsi1\u002FCoreInfer)\u003Cbr>[CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18311#) \u003Cbr> Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_19d89833b500.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fwangqinsi1\u002FCoreInfer) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18311#)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fduo-attention.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fduo-attention)\u003Cbr>[DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10819) \u003Cbr> Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_2c701b8ab49f.jpg\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fduo-attention) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10819)|[\u002F\u002F]: #10\u002F21\n|[DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11744) \u003Cbr> Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, Tianhao Wu, Lei Zou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_05d786052eaf.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11744)|[\u002F\u002F]: #10\u002F21\n|[QSpec: Speculative Decoding with Complementary Quantization Schemes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11305) \u003Cbr> Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_1da06f958d43.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11305)|[\u002F\u002F]: #10\u002F21\n|[TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05076) \u003Cbr> Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b82cd2405e73.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05076)|[\u002F\u002F]: #10\u002F14\n|[ParallelSpec: Parallel Drafter for Efficient Speculative Decoding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05589) \u003Cbr> Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_64b28cb53560.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05589)|[\u002F\u002F]: #10\u002F14\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhemingkx\u002FSWIFT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhemingkx\u002FSWIFT)\u003Cbr>[SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06916) \u003Cbr> Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d2b7a7c3e0dd.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fhemingkx\u002FSWIFT) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06916)|[\u002F\u002F]: #10\u002F14\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMooreThreads\u002FTurboRAG.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMooreThreads\u002FTurboRAG)\u003Cbr>[TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07590) \u003Cbr> Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d41beb613f61.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FMooreThreads\u002FTurboRAG) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07590)|[\u002F\u002F]: #10\u002F13\n|[A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01485) \u003Cbr> Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_13d8bec83858.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01485)|[\u002F\u002F]: #10\u002F04\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-SIGMOD'25-blue)]()\u003Cbr>[Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2502.15734) \u003Cbr> Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, Shiv Saini |\u003Cimg width=\"600\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a7eaeaf1025d.png\"> | \u003Cbr> [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2502.15734)|[\u002F\u002F]: #02\u002F05\n|[Mamba Drafters for Speculative Decoding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01206) \u003Cbr> Daewon Choi, Seunghyuk Oh, Saket Dingliwal, Jihoon Tack, Kyuyoung Kim, Woomin Song, Seojin Kim, Insu Han, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e49500014f3e.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01206)|[\u002F\u002F]: #06\u002F03\n|[Accelerated Test-Time Scaling with Model-Free Speculative Sampling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04708) \u003Cbr> Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_dde21153ee5c.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04708)|[\u002F\u002F]: #06\u002F05\n\n### Efficient MOE\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvmazur\u002Fmixtral-offloading.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdvmazur\u002Fmixtral-offloading)\u003Cbr>:star: [Fast Inference of Mixture-of-Experts Language Models with Offloading](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17238) \u003Cbr> Artyom Eliseev, Denis Mazur |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5bab035f2588.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fdvmazur\u002Fmixtral-offloading) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17238)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fduterscmy\u002FCD-MoE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fduterscmy\u002FCD-MoE)\u003Cbr>[Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00069) \u003Cbr> Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, Lu Yin |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_1295fc6b67f1.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fduterscmy\u002FCD-MoE) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00069)|[\u002F\u002F]: #12\u002F09\n|[Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00099) \u003Cbr> Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, Babak Ehteshami Bejnordi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d4fb821981e8.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00099)|[\u002F\u002F]: #12\u002F09\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEnflameTechnology\u002FDeepSpeed.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FEnflameTechnology\u002FDeepSpeed)\u003Cbr>[MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00662) \u003Cbr> Jingming Guo, Yan Liu, Yu Meng, Zhiwei Tao, Banglan Liu, Gang Chen, Xiang Li |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7d696b0f7b2c.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FEnflameTechnology\u002FDeepSpeed) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00662)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxiaochengsky\u002FMoEI-2.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fxiaochengsky\u002FMoEI-2)\u003Cbr>[MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01016) \u003Cbr> Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, Bo Yuan |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7e36972d2be6.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fxiaochengsky\u002FMoEI-2) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01016)|[\u002F\u002F]: #11\u002F18\n|[HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01433) \u003Cbr> Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_51a7a192a919.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01433)|[\u002F\u002F]: #11\u002F18\n|[ProMoE: Fast MoE-based LLM Serving using Proactive Caching](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22134) \u003Cbr> Xiaoniu Song, Zihang Zhong, Rong Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b2fabf32e265.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22134)|[\u002F\u002F]: #11\u002F17\n|[ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17954) \u003Cbr> Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b34d16633a90.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17954)|[\u002F\u002F]: #10\u002F29\n|[EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12247) \u003Cbr> Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12247)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAaronhuang-778\u002FMC-MoE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAaronhuang-778\u002FMC-MoE)\u003Cbr>[MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06270) \u003Cbr> Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3065b47eae2e.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FAaronhuang-778\u002FMC-MoE) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06270)|[\u002F\u002F]: #10\u002F14\n\n\n\n### Efficient Architecture of LLM\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002Fhymba.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fhymba) ![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'25-blue) \u003Cbr>[Hymba: A Hybrid-head Architecture for Small Language Models](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2411.13676) \u003Cbr> Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0e09ecb6d280.png\"> |[Paper](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2411.13676)|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FMobiLlama.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FMobiLlama)\u003Cbr>:star: [MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16840) \u003Cbr> Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan |\u003Cimg width=\"402\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5097e6423155.gif\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FMobiLlama) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16840) \u003Cbr>[Model](https:\u002F\u002Fhuggingface.co\u002FMBZUAI\u002FMobiLlama-05B) | [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXuezheMax\u002Fmegalodon.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FXuezheMax\u002Fmegalodon)\u003Cbr>:star: [Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.08801) \u003Cbr> Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3e6d9e588670.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FXuezheMax\u002Fmegalodon) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.08801)| [\u002F\u002F]: #Recommend\n|[Taipan: Efficient and Expressive State Space Language Models with Selective Attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18572) \u003Cbr> Chien Van Nguyen, Huy Huu Nguyen, Thang M. Pham, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Ryan A. Rossi, Trung Bui, Viet Dac Lai, Franck Dernoncourt, Thien Huu Nguyen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_286eb3c2168c.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18572)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FSeerAttention.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSeerAttention)\u003Cbr>[SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13276) \u003Cbr> Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a5fffc1aae14.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSeerAttention) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13276)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTUDa-HWAI\u002FBasis_Sharing.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTUDa-HWAI\u002FBasis_Sharing)\u003Cbr>[Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03765) \u003Cbr> Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, Grace Li Zhang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_53778025f2c6.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FTUDa-HWAI\u002FBasis_Sharing) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03765)|[\u002F\u002F]: #10\u002F14\n|[Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06577) \u003Cbr> Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f08feffc0375.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06577)|[\u002F\u002F]: #10\u002F14\n|[Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01215) \u003Cbr> Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5f148dd3d747.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01215)|[\u002F\u002F]: #06\u002F03\n\n\n### KV Cache Compression\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|:star: [Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01801) \u003Cbr> Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_33749ef570b1.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01801)| [\u002F\u002F]: #Recommend\n|[ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03213) \u003Cbr> Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_db78ee1df40b.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03213)|[\u002F\u002F]: #12\u002F09\n|[Unifying KV Cache Compression for Large Language Models with LeanKV](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03131) \u003Cbr> Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5da6131bb0ec.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03131)|[\u002F\u002F]: #12\u002F09\n|[Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02252) \u003Cbr> Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7aa5d746563e.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02252)|[\u002F\u002F]: #12\u002F09\n|[MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18077) \u003Cbr> Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, Minjia Zhang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_821b2fa29a81.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18077)|[\u002F\u002F]: #12\u002F07\n|[TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02886) \u003Cbr> Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Kun Fu, Zheng Wang, Hui Xiong |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c12bc0cf294e.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02886)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFYYFU\u002FHeadKV.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFYYFU\u002FHeadKV)\u003Cbr>[Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19258) \u003Cbr> Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_ebaece5485e2.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FFYYFU\u002FHeadKV) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19258)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJunqiZhao888\u002Fbuzz-llm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FJunqiZhao888\u002Fbuzz-llm)\u003Cbr>[BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23079) \u003Cbr> Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b61c659616b5.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FJunqiZhao888\u002Fbuzz-llm) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23079)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FwhyNLP\u002FLCKV.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FwhyNLP\u002FLCKV)\u003Cbr>[A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14442) \u003Cbr> You Wu, Haoyi Wu, Kewei Tu |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_ed7355f6f1e9.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FwhyNLP\u002FLCKV) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14442)|[\u002F\u002F]: #10\u002F30\n|[Lossless KV Cache Compression to 2%](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15252) \u003Cbr> Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a57c596932fd.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15252)|[\u002F\u002F]: #10\u002F30\n|[MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14731) \u003Cbr> Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_89553282783b.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14731)|[\u002F\u002F]: #10\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fiankur\u002Fvqllm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fiankur\u002Fvqllm)\u003Cbr>[Residual vector quantization for KV cache compression in large language model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15704) \u003Cbr> Ankur Kumar | |[Github](https:\u002F\u002Fgithub.com\u002Fiankur\u002Fvqllm) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15704)|[\u002F\u002F]: #10\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangyifei729\u002FKVSharer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyangyifei729\u002FKVSharer)\u003Cbr>[KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18517) \u003Cbr> Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_49ab068a8b18.jpg\"> |[Github](https:\u002F\u002Fgithub.com\u002Fyangyifei729\u002FKVSharer) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18517)|[\u002F\u002F]: #10\u002F29\n|[LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03111) \u003Cbr> Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_10a0053e5c81.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03111)|[\u002F\u002F]: #10\u002F14\n|[SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03960) \u003Cbr> Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_9bccefa9ed53.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03960)|[\u002F\u002F]: #10\u002F14\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICML'24-blue)]()\u003Cbr>[Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09636) \u003Cbr> Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_25361d534e4d.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09636)|[\u002F\u002F]: #10\u002F02\n|[KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00161) \u003Cbr> Isaac Rehg |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c27d348a7f0e.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00161)|[\u002F\u002F]: #10\u002F02\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFFY0\u002FAdaKV.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFFY0\u002FAdaKV)\u003Cbr>[Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11550) \u003Cbr> Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5944a12e847b.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FFFY0\u002FAdaKV) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11550)|[\u002F\u002F]: #10\u002F13\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FContext-Memory.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FContext-Memory) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'24-blue)]() \u003Cbr>[Compressed Context Memory for Online Language Model Interaction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03414) \u003Cbr> Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song |\u003Cimg width=\"902\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e2bd86a5a5ce.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FContext-Memory) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03414)|[\u002F\u002F]: #10\u002F13\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FKVzip.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FKVzip)\u003Cbr>[KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23416) \u003Cbr> Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a4683bbb6335.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FKVzip) [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23416)|[\u002F\u002F]: #05\u002F29\n\n\n### Text Compression\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'23-blue)]()\u003Cbr>:star: [LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05736) \u003Cbr> Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_416e1066617b.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05736)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falipay\u002FL3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Falipay\u002FL3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression)\u003Cbr>[L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16642) \u003Cbr> Junxuan Zhang, Zhengxue Cheng, Yan Zhao, Shihao Wang, Dajiang Zhou, Guo Lu, Li Song |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_edb6bd449a62.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Falipay\u002FL3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16642)|[\u002F\u002F]: #12\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNL2G\u002Fpromptoptme.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNL2G\u002Fpromptoptme)\u003Cbr>[PromptOptMe: Error-Aware Prompt Compression for LLM-based MT Evaluation Metrics](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16120) \u003Cbr> Daniil Larionov, Steffen Eger |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a4176537b8bd.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FNL2G\u002Fpromptoptme) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16120)|[\u002F\u002F]: #12\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)\u003Cbr>:star: [LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06839) \u003Cbr> Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_75bde773ef3c.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06839)| [\u002F\u002F]: #Recommend\n|[A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17483) \u003Cbr> Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f62c7f75f82f.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17483)|[\u002F\u002F]: #12\u002F30\n|[JPPO: Joint Power and Prompt Optimization for Accelerated Large Language Model Services](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18010) \u003Cbr> Feiran You, Hongyang Du, Kaibin Huang, Abbas Jamalipour |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5f444bdfcf92.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18010)|[\u002F\u002F]: #12\u002F07\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkaistai\u002FGenPI.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkaistai\u002FGenPI)\u003Cbr>[Generative Prompt Internalization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15927) \u003Cbr> Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_68b9f4975e3d.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fkaistai\u002FGenPI) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15927)|[\u002F\u002F]: #12\u002F02\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnoelkelias\u002Fmultitok.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fnoelkelias\u002Fmultitok)\u003Cbr>[MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21548) \u003Cbr> Noel Elias, Homa Esfahanizadeh, Kaan Kale, Sriram Vishwanath, Muriel Medard |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c6e7649370d9.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fnoelkelias\u002Fmultitok) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21548)|[\u002F\u002F]: #11\u002F17\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]()\u003Cbr>[Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11786) \u003Cbr> Tsz Ting Chung, Leyang Cui, Lemao Liu, Xinting Huang, Shuming Shi, Dit-Yan Yeung |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f27312d71db2.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11786)|[\u002F\u002F]: #10\u002F21\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]()\u003Cbr>[From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04139) \u003Cbr> Eunseong Choi, Sunkyung Lee, Minjin Choi, June Park, Jongwuk Lee |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_cbf1bd50fb1a.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04139)|[\u002F\u002F]: #10\u002F14\n|[Perception Compressor:A training-free prompt compression method in long context scenarios](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.19272) \u003Cbr> Jiwei Tang, Jin Xu, Tingwei Lu, Hai Lin, Yiming Zhao, Hai-Tao Zheng |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c12eb911caed.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.19272)|[\u002F\u002F]: #10\u002F02\n| [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWorkday\u002Fcpc.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FWorkday\u002Fcpc)![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-AAAI'25-blue)\u003Cbr>[Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.01227) \u003Cbr> Barys Liskavets, Maxim Ushakov, Shuvendu Roy, Mark Klibanov, Ali Etemad, Shane Luke |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_8c1ada575b42.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FWorkday\u002Fcpc) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.01227)|[\u002F\u002F]: #12\u002F30\n| [Task-agnostic Prompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13374v1) \u003Cbr> Barys Liskavets, Shuvendu Roy, Maxim Ushakov, Mark Klibanov, Ali Etemad, Shane Luke |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_89e9fe6f2989.png\"> | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13374v1)|[\u002F\u002F]: #12\u002F30\n\n### Low-Rank Decomposition\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24-blue)]()\u003Cbr>[ESPACE: Dimensionality Reduction of Activations for Model Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05437) \u003Cbr> Charbel Sakr, Brucek Khailany |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_996e9b05616b.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05437)|[\u002F\u002F]: #10\u002F14\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fselfsupervised-ai\u002FNatural-GaLore.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore)\u003Cbr>[Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029) \u003Cbr> Arijit Das | |[Github](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029)|[\u002F\u002F]: #10\u002F30\n|[CompAct: Compressed Activations for Memory-Efficient LLM Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15352) \u003Cbr> Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_736f3e6ba671.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15352)|[\u002F\u002F]: #10\u002F30\n\n### Hardware\u002FSystem\u002FServing\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|[KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18169) \u003Cbr> Rongxin Cheng, Yifan Peng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_23cbed4508da.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18169)|[\u002F\u002F]: #12\u002F30\n|[FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18424) \u003Cbr> Ao Shen, Zhiyao Li, Mingyu Gao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a048911f1bd0.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18424)|[\u002F\u002F]: #12\u002F07\n|[CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02829) \u003Cbr> Hongpeng Jin, Yanzhao Wu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3e104160d403.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02829)|[\u002F\u002F]: #11\u002F18\n|[Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19274) \u003Cbr> Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b07073617c87.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19274)|[\u002F\u002F]: #11\u002F17\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICCAD'24-blue)]()\u003Cbr>[ALISE: Accelerating Large Language Model Serving with Speculative Scheduling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23537) \u003Cbr> Youpeng Zhao, Jun Wang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b563852ec107.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23537)|[\u002F\u002F]: #11\u002F17\n|[EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15332) \u003Cbr> Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_2c564245970e.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15332)|[\u002F\u002F]: #10\u002F30\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24-blue)]()\u003Cbr>[SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15526) \u003Cbr> Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, Dingwen Tao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_224a6f45ced6.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15526)|[\u002F\u002F]: #10\u002F30\n|[FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16663) \u003Cbr> Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan et al |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b9382bd22920.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16663)|[\u002F\u002F]: #10\u002F29\n|[POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18038) \u003Cbr> Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_1bd4a09dba02.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18038)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLizonghang\u002FTPI-LLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLizonghang\u002FTPI-LLM)\u003Cbr>[TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00531) \u003Cbr> Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_2238a97214ad.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FLizonghang\u002FTPI-LLM) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00531)|[\u002F\u002F]: #10\u002F02\n\n\n\n### Efficient Fine-tuning\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ACL'25%20Findings-blue)]()\u003Cbr>[LoRMA: Low-Rank Multiplicative Adaptation for LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07621) \u003Cbr> Harsh Bihany, Shubham Patel, Ashutosh Modi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e69e728376d1.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07621)|[\u002F\u002F]: #06\u002F16\n|[HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10696) \u003Cbr> Huaqin Zhao, Jiaxi Li, Yi Pan, Shizhe Liang, Xiaofeng Yang, Wei Liu, Xiang Li, Fei Dou, Tianming Liu, Jin Lu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_bb3c4259083d.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10696)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLCS2-IIITD\u002FMonteCLoRA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLCS2-IIITD\u002FMonteCLoRA)\u003Cbr>[Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04358) \u003Cbr> Ayan Sengupta, Vaibhav Seth, Arinjay Pathak, Natraj Raman, Sriram Gopalakrishnan, Tanmoy Chakraborty |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_6233fbed144e.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FLCS2-IIITD\u002FMonteCLoRA) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04358)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fselfsupervised-ai\u002FNatural-GaLore.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore)\u003Cbr>[Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029) \u003Cbr> Arijit Das | |[Github](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029)|[\u002F\u002F]: #10\u002F30\n|[Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19694) \u003Cbr> Yifei Zhang, Hao Zhu, Aiwei Liu, Han Yu, Piotr Koniusz, Irwin King |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_88d7e76e9be1.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19694)|[\u002F\u002F]: #11\u002F18\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]()\u003Cbr>[MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18035) \u003Cbr> Jingfan Zhang, Yi Zhao, Dan Chen, Xing Tian, Huanran Zheng, Wei Zhu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b2ce406bf6f1.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18035)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKowsher\u002FRoCoFT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FKowsher\u002FRoCoFT)\u003Cbr>[RoCoFT: Efficient Finetuning of Large Language Models with Row-Column Updates](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10075) \u003Cbr> Md Kowsher, Tara Esmaeilbeig, Chun-Nam Yu, Mojtaba Soltanalian, Niloofar Yousefi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_45f0a6ddca9c.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FKowsher\u002FRoCoFT) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10075)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKaiseem\u002FIST.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FKaiseem\u002FIST)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24-blue)]()\u003Cbr>[Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11772) \u003Cbr> Kai Yao, Penlei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, Jianke Zhu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_2f0b0fd5a5bd.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FKaiseem\u002FIST) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11772)|[\u002F\u002F]: #10\u002F21\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-Nature%20Scientific%20Reports-blue)]()\u003Cbr>[Parameter-Efficient Fine-Tuning of Large Language Models using Semantic Knowledge Tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08598) \u003Cbr> Nusrat Jahan Prottasha, Asif Mahmud, Md. Shohanur Islam Sobuj, Prakash Bhat, Md Kowsher, Niloofar Yousefi, Ozlem Ozmen Garibay |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_ebf0219fc8fe.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08598)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxvyaward\u002Fqeft.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fxvyaward\u002Fqeft)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]()\u003Cbr>[QEFT: Quantization for Efficient Fine-Tuning of LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08661) \u003Cbr> Changhun Lee, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_9e61a9585690.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fxvyaward\u002Fqeft) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08661)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAofei-Chang\u002FBIPEFT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAofei-Chang\u002FBIPEFT)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]()\u003Cbr>[BIPEFT: Budget-Guided Iterative Search for Parameter Efficient Fine-Tuning of Large Pretrained Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09079) \u003Cbr> Aofei Chang, Jiaqi Wang, Han Liu, Parminder Bhatia, Cao Xiao, Ting Wang, Fenglong Ma |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b49f88abca19.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FAofei-Chang\u002FBIPEFT) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09079)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsayankotor\u002Fsparse_grads.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsayankotor\u002Fsparse_grads)\u003Cbr>[SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07383) \u003Cbr> Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev, Alexander Panchenko, Ivan Oseledets |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_072c7242cd1b.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fsayankotor\u002Fsparse_grads) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07383)|[\u002F\u002F]: #10\u002F13\n|[SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06364) \u003Cbr> Tianyi Zhang, Junda Su, Oscar Wu, Zhaozhuo Xu, Anshumali Shrivastava |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_465c6a247982.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06364)|[\u002F\u002F]: #10\u002F13\n\n\n\n### Efficient Training\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fneiterman21\u002FLDB.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fneiterman21\u002FLDB)\u003Cbr>[LayerDropBack: A Universally Applicable Approach for Accelerating Training of Deep Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18027) \u003Cbr> Evgeny Hershkovitch Neiterman, Gil Ben-Artzi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3a004c1b9def.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fneiterman21\u002FLDB) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18027)|[\u002F\u002F]: #12\u002F30\n|[AutoMixQ: Self-Adjusting Quantization for High Performance Memory-Efficient Fine-Tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13814) \u003Cbr> Changhai Zhou, Shiyang Zhang, Yuhua Zhou, Zekai Liu, Shichao Weng |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_6228f3424bfd.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13814)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FLPA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FLPA)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24-blue)]()\u003Cbr>[Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02063) \u003Cbr> Xingtai Lv, Ning Ding, Kaiyan Zhang, Ermo Hua, Ganqu Cui, Bowen Zhou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3e669dc89fb9.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FLPA) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02063)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FCOAT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FCOAT)\u003Cbr>[COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19313) \u003Cbr> Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, Song Han |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_26370ac3c601.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FCOAT) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19313)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwuhouming\u002FBitPipe.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwuhouming\u002FBitPipe)\u003Cbr>[BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19367) \u003Cbr> Houming Wu, Ling Chen, Wenjie Yu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fwuhouming\u002FBitPipe\u002Fraw\u002Fmain\u002Fdocs\u002FBitPipe_images\u002FBitPipe-v.svg\"> |[Github](https:\u002F\u002Fgithub.com\u002Fwuhouming\u002FBitPipe) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19367)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fselfsupervised-ai\u002FNatural-GaLore.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore)\u003Cbr>[Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029) \u003Cbr> Arijit Das | |[Github](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029)|[\u002F\u002F]: #10\u002F30\n|[CompAct: Compressed Activations for Memory-Efficient LLM Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15352) \u003Cbr> Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_736f3e6ba671.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15352)|[\u002F\u002F]: #10\u002F30\n\n\n\n### Survey (or Benchmark)\n| Title & Authors | Introduction | Links |\n|:--|  :----: | :---:|\n|[Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13157) \u003Cbr> Hyun Ryu, Eric Kim |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_75d60a983a59.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13157)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fargonne-lcf\u002FLLM-Inference-Bench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fargonne-lcf\u002FLLM-Inference-Bench)\u003Cbr>[LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00136) \u003Cbr> Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus et al | |[Github](https:\u002F\u002Fgithub.com\u002Fargonne-lcf\u002FLLM-Inference-Bench) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00136)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZongqianLi\u002FPrompt-Compression-Survey.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZongqianLi\u002FPrompt-Compression-Survey)\u003Cbr>[Prompt Compression for Large Language Models: A Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12388) \u003Cbr> Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f6515b4dff5d.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FZongqianLi\u002FPrompt-Compression-Survey) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12388)|[\u002F\u002F]: #10\u002F21\n|[Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04466) \u003Cbr> Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_779b11575682.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04466)|[\u002F\u002F]: #10\u002F14\n\n\n\n\n\n\n","# 令人惊叹的高效大语言模型\n一份精选的**高效大语言模型**列表\n\n## 完整列表\n  - [网络剪枝\u002F稀疏化](pruning.md)\n  - [知识蒸馏](knowledge_distillation.md)\n  - [量化](quantization.md)\n  - [推理加速](inference_acceleration.md)\n  - [高效MoE](efficient_moe.md)\n  - [高效LLM架构](efficient_architecture_llm.md)\n  - [KV缓存压缩](kv_cache_compression.md)\n  - [文本压缩](text_compression.md)\n  - [低秩分解](low_rank_decomposition.md)\n  - [硬件\u002F系统\u002F服务](hardware.md)\n  - [高效微调](tuning.md)\n  - [高效训练](efficient_training.md)\n  - [综述或基准测试](survey.md)\n  - [推理模型](https:\u002F\u002Fgithub.com\u002Ffscdc\u002FAwesome-Efficient-Reasoning-Models)\n\n### 请根据您感兴趣的子领域查看所有论文。在本主页上，仅展示过去90天内发布的论文。\n\n#### 🚀 更新\n* 2025年4月15日：我们新增了一个针对**高效推理模型**的精选列表！\n* 2024年5月29日：这个优秀的列表已经推出一年了 :smiling_face_with_three_hearts:!\n* 2023年9月6日：新增子目录[project\u002F](project\u002F)，用于整理高效的LLM项目。\n* 2023年7月11日：创建了一个新的子目录[efficient_plm\u002F](efficient_plm\u002F)，用于存放适用于PLM的论文。\n\n#### 💮 贡献\n如果您希望将您的论文加入列表，或者需要更新会议信息、代码链接等细节，请随时提交拉取请求。您可以通过填写`generate_item.py`中的信息并运行`python generate_item.py`来生成每篇论文所需的Markdown格式。我们非常感谢您对本列表的贡献。此外，您也可以通过邮件将您的论文和代码链接发送给我，我会尽快将您的论文添加到列表中。\n\n#### :star: 推荐论文\n针对每个主题，我们都精选了一些获得了大量GitHub星标或引用的推荐论文。\n\n## 2024年9月30日至今的论文（完整列表自2023年5月22日起见[这里](#full-list)）\n\n### 快速链接\n  - [网络剪枝\u002F稀疏化](#network-pruning--sparsity)\n  - [知识蒸馏](#knowledge-distillation)\n  - [量化](#quantization)\n  - [推理加速](#inference-acceleration)\n  - [高效MoE](#efficient_moe)\n  - [高效LLM架构](#efficient-architecture-of-llm)\n  - [KV缓存压缩](#kv-cache-compression)\n  - [文本压缩](#text-compression)\n  - [低秩分解](#low-rank-decomposition)\n  - [硬件\u002F系统\u002F服务](#hardwaresystemserving)\n  - [高效微调](#efficient-fine-tuning)\n  - [高效训练](#efficient-training)\n  - [综述](#survey-or-benchmark)\n\n### 网络剪枝 \u002F 稀疏化\n| 标题与作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n| [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fsparsegpt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fsparsegpt) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICML'23-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FUnstructured-C2A4A6)]() \u003Cbr> :star: [SparseGPT: 大规模语言模型可一次性准确剪枝](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fsparsegpt) \u003Cbr> Elias Frantar, Dan Alistarh| \u003Cimg width=\"522\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_ac4a426f6bcd.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fsparsegpt) [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.00774) | [\u002F\u002F]: #推荐\n| [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhorseee\u002FLLM-Pruner.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhorseee\u002FLLM-Pruner) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'23-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FStructural-C2A4A6)]() \u003Cbr> :star: [LLM-Pruner: 关于大型语言模型的结构化剪枝](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11627) \u003Cbr> Xinyin Ma, Gongfan Fang, Xinchao Wang | \u003Cimg width=\"561\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5cf3348735fa.png\">| [Github](https:\u002F\u002Fgithub.com\u002Fhorseee\u002FLLM-Pruner) [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11627)| [\u002F\u002F]: #推荐\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flocuslab\u002Fwanda.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Flocuslab\u002Fwanda) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'24-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FUnstructured-C2A4A6)]()  \u003Cbr> :star: [一种简单有效的大型语言模型剪枝方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11695) \u003Cbr> Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_521ecaeb6396.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Flocuslab\u002Fwanda) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11695)| [\u002F\u002F]: #推荐\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-nlp\u002FLLM-Shearing.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'24-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FStructural-C2A4A6)]() \u003Cbr> :star: [Sheared LLaMA: 通过结构化剪枝加速语言模型预训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06694) \u003Cbr> Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_9468b81db3a7.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06694)| [\u002F\u002F]: #推荐\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FMaskLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FMaskLLM) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSemi_Structured-C2A4A6)]() \u003Cbr> :star: [MaskLLM: 可学习的半结构化稀疏性用于大型语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17481) \u003Cbr> Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, Xinchao Wang |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f80cb94f4200.gif\"> |[Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FMaskLLM) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17481)|[\u002F\u002F]: #推荐\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FMamba-Shedder) [![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NAACL'25-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FStructural-C2A4A6)]() \u003Cbr>[Mamba-Shedder: 转换器后压缩，用于高效的有选择性结构化状态空间模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17088) \u003Cbr> Juan Pablo Munoz, Jinjie Yuan, Nilesh Jain |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5f33e950b549.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FMamba-Shedder) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17088)|[\u002F\u002F]: #01\u002F28\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FMultiPruner) [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FStructural-C2A4A6)]() \u003Cbr>[MultiPruner: 基础模型中的均衡结构移除](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09949) \u003Cbr> Juan Pablo Munoz, Jinjie Yuan, Nilesh Jain |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_036bbbcf80a2.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FMultiPruner) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09949)|[\u002F\u002F]: #01\u002F17\n|[HashAttention: 语义稀疏性以实现更快的推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14468) \u003Cbr> Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f5a1e0346d11.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14468)|[\u002F\u002F]: #12\u002F30\n|[具有结构重要性感知的大型语言模型自适应剪枝](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15127) \u003Cbr> Haotian Zheng, Jinke Ren, Yushan Sun, Ruichen Zhang, Wenbo Zhang, Zhen Li, Dusit Niyato, Shuguang Cui, Yatong Han |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c8ce41ddfa58.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15127)|[\u002F\u002F]: #12\u002F30\n|[SlimGPT: 大型语言模型的逐层结构化剪枝](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18110) \u003Cbr> Gui Ling, Ziyang Wang, Yuliang Yan, Qingwen Liu |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c6934c04aa82.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18110)|[\u002F\u002F]: #12\u002F30\n|[少即是多：通过统一的结构化剪枝迈向绿色代码大型语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15921) \u003Cbr> Guang Yang, Yu Zhou, Xiangyu Zhang, Wei Cheng, Ke Liu, Xiang Chen, Terry Yue Zhuo, Taolue Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_aea60dd40969.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15921)|[\u002F\u002F]: #12\u002F30\n|[利用动态输入剪枝和缓存感知掩码实现高效LLM推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01380) \u003Cbr> Marco Federici, Davide Belli, Mart van Baalen, Amir Jalalirad, Andrii Skliar, Bence Major, Markus Nagel, Paul Whatmough |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f29176d3d795.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01380)|[\u002F\u002F]: #12\u002F09\n|[Puzzle: 基于蒸馏的NAS，用于优化推理的LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19146) \u003Cbr> Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah等 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7d2818aba4cd.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19146)|[\u002F\u002F]: #12\u002F09\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyaolu-zjut\u002FNavigation-LLM-layer-pruning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyaolu-zjut\u002FNavigation-LLM-layer-pruning)\u003Cbr>[重新评估LLM中的层剪枝：新见解与方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15558) \u003Cbr> Yao Lu, Hao Cheng, Yujie Fang, Zeyu Wang, Jiaheng Wei, Dongwei Xu, Qi Xuan, Xiaoniu Yang, Zhaowei Zhu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_62442edfe904.jpg\"> |[Github](https:\u002F\u002Fgithub.com\u002Fyaolu-zjut\u002FNavigation-LLM-layer-pruning) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15558)|[\u002F\u002F]: #12\u002F03\n|[通过增强激活方差-稀疏性分析大型语言模型中的层重要性和幻觉现象](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10069) \u003Cbr> Zichen Song, Sitan Huang, Yuxin Wu, Zhongfeng Kang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_01be2c14c9b9.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10069)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGATECH-EIC\u002FAmoebaLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FGATECH-EIC\u002FAmoebaLLM)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24-blue)]()\u003Cbr>[AmoebaLLM: 构建任意形状的大语言模型，以实现高效即时部署](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10606) \u003Cbr> Yonggan Fu, Zhongzhi Yu, Junwei Li, Jiayi Qian, Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, Yingyan Celine Lin |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_314967159f26.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FGATECH-EIC\u002FAmoebaLLM) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10606)|[\u002F\u002F]: #11\u002F24\n|[模型剪枝后的训练后缩放定律](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10272) \u003Cbr> Xiaodong Chen, Yuxuan Hu, Jing Zhang, Xiaokang Zhang, Cuiping Li, Hong Chen | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10272)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhexuandeng\u002FDRPruning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhexuandeng\u002FDRPruning)\u003Cbr>[DRPruning: 通过分布鲁棒优化实现高效的大语言模型剪枝](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14055) \u003Cbr> Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Min Zhang, Zhaopeng Tu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_935629aee5fb.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fhexuandeng\u002FDRPruning) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14055)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthunlp\u002FSparsingLaw.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FSparsingLaw)\u003Cbr>[Sparsing Law: 朝着更高激活稀疏性的大型语言模型迈进](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02335) \u003Cbr> Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5cb5a1756fd0.jpg\"> |[Github](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FSparsingLaw) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02335)|[\u002F\u002F]: #11\u002F18\n|[AVSS: 通过激活方差-稀疏性分析评估大型语言模型中的层重要性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02117) \u003Cbr> Zichen Song, Yuxin Wu, Sitan Huang, Zhongfeng Kang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d08979eccf5f.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02117)|[\u002F\u002F]: #11\u002F18\n|[Tailored-LLaMA: 使用特定任务提示优化剪枝后的LLaMA模型中的少量样本学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19185) \u003Cbr> Danyal Aftab, Steven Davy |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_77fb0596e5e8.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19185)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAboveParadise\u002FLLMCBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAboveParadise\u002FLLMCBench)\u003Cbr>[LLMCBench: 为高效部署而进行的大型语言模型压缩基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21352) \u003Cbr> Ge Yang, Changyi He, Jinyang Guo, Jianyu Wu, Yifu Ding, Aishan Liu, Haotong Qin, Pengliang Ji, Xianglong Liu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_deaae41a402e.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FAboveParadise\u002FLLMCBench) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21352)|[\u002F\u002F]: #11\u002F17\n|[超越2:4：探索V:N:M稀疏性以在GPU上高效推理Transformer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16135) \u003Cbr> Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_73be5b39d1de.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16135)|[\u002F\u002F]: #10\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002FEvoPress.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002FEvoPress)\u003Cbr>[EvoPress: 通过进化搜索迈向最优动态模型压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14649) \u003Cbr> Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_59ae0da88dd3.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002FEvoPress) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14649)|[\u002F\u002F]: #10\u002F30\n|[FedSpaLLM: 大型语言模型的联邦剪枝](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14852) \u003Cbr> Guangji Bai, Yijiang Li, Zilinghan Li, Liang Zhao, Kibaek Kim |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_91ef9eeedf69.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14852)|[\u002F\u002F]: #10\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpiuzha\u002FAPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpiuzha\u002FAPT)\u003Cbr>[无需再训练即可实现高精度的基础模型剪枝](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15567) \u003Cbr> Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin | |[Github](https:\u002F\u002Fgithub.com\u002Fpiuzha\u002FAPT) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15567)|[\u002F\u002F]: #10\u002F30\n|[语言模型量化和剪枝的自校准](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17170) \u003Cbr> Miles Williams, George Chrysostomou, Nikolaos Aletras |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_cff521848222.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17170)|[\u002F\u002F]: #10\u002F29\n|[警惕用于大型语言模型剪枝的校准数据](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17711) \u003Cbr> Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17711)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaiquanlu\u002FAlphaPruning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhaiquanlu\u002FAlphaPruning)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24-blue)]()\u003Cbr>[AlphaPruning: 利用重尾自正则化理论改进大型语言模型的逐层剪枝](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10912) \u003Cbr> Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W. Mahoney, Yaoqing Yang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7ca13d0304b2.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fhaiquanlu\u002FAlphaPruning) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10912)|[\u002F\u002F]: #10\u002F21\n|[超越线性近似：注意力矩阵的一种新型剪枝方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11261) \u003Cbr> Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_4a68705f5aff.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11261)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZhengaoLi\u002FDISP-LLM-Dimension-Independent-Structural-Pruning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZhengaoLi\u002FDISP-LLM-Dimension-Independent-Structural-Pruning)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24-blue)]()\u003Cbr>[DISP-LLM: 不受维度限制的大型语言模型结构化剪枝](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11988) \u003Cbr> Shangqian Gao, Chi-Heng Lin, Ting Hua, Tang Zheng, Yilin Shen, Hongxia Jin, Yen-Chang Hsu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_086bdae2d26c.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FZhengaoLi\u002FDISP-LLM-Dimension-Independent-Structural-Pruning) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11988)|[\u002F\u002F]: #10\u002F21\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24%20Workshop-blue)]()\u003Cbr>[自我数据蒸馏以恢复剪枝后大型语言模型的质量](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09982) \u003Cbr> Vithursan Thangarasa, Ganesh Venkatesh, Nish Sinnadurai, Sean Lie |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_063d03fee2a3.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09982)|[\u002F\u002F]: #10\u002F21\n|[LLM-Rank: 一种基于图论的大型语言模型剪枝方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13299) \u003Cbr> David Hoffmann, Kailash Budhathoki, Matthaeus Kleindessner |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_023d4de5db07.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13299)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fabx393\u002Fllm-pruning-calibration-data.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fabx393\u002Fllm-pruning-calibration-data)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24-blue)]()\u003Cbr>[C4数据集是否最适合剪枝？关于LLM剪枝校准数据的调查](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07461) \u003Cbr> Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Kumar Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, Shiwei Liu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_feb79788994f.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fabx393\u002Fllm-pruning-calibration-data) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07461)|[\u002F\u002F]: #10\u002F13\n|[通过神经元剪枝缓解上下文学习中的复制偏差](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01288) \u003Cbr> Ameen Ali, Lior Wolf, Ivan Titov |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b2b4e907d64b.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01288)|[\u002F\u002F]: #10\u002F04\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FSQFT)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FUnstructured-C2A4A6)]() [![Type](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fw\u002FQuantization-39B0A9)]() \u003Cbr>[SQFT: 低成本模型适配，适用于低精度稀疏基础模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03750) \u003Cbr> Juan Pablo Munoz, Jinjie Yuan, Nilesh Jain |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e3cafc046795.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIntelLabs\u002FHardware-Aware-Automated-Machine-Learning\u002Ftree\u002Fmain\u002FSQFT) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03750)|[\u002F\u002F]: #10\u002F01\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPiotrNawrot\u002Fsparse-frontier.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPiotrNawrot\u002Fsparse-frontier)\u003Cbr>[稀疏前沿：Transformer LLM中的稀疏注意力权衡](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.17768) \u003Cbr> Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_cccb1663ac9d.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FPiotrNawrot\u002Fsparse-frontier) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.17768)|[\u002F\u002F]: #05\u002F05\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwoominsong\u002FSimba.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwoominsong\u002FSimba)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJournal-TMLR_2025-blue)]()\u003Cbr>[稀疏化状态空间模型是高效的高速公路网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20698) \u003Cbr> Woomin Song, Jihoon Tack, Sangwoo Mo, Seunghyuk Oh, Jinwoo Shin |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f6154ee98427.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fwoominsong\u002FSimba) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20698)|[\u002F\u002F]: #06\u002F03\n\n### 知识蒸馏\n| 标题与作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|:star: [大型语言模型的知识蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.08543) \u003Cbr> 顾宇贤、董力、魏福鲁、黄民烈 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0844e23723ab.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLMOps\u002Ftree\u002Fmain\u002Fminillm) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.08543)| [\u002F\u002F]: #推荐\n|[![发布](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-COLING'25-blue)]()\u003Cbr>[基于LLM的机器翻译的自我进化知识蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15303) \u003Cbr> 宋云成、丁亮、赞昌通、黄树坚 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_72efe80e6db6.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15303)|[\u002F\u002F]: #12\u002F30\n|[通过低秩特征蒸馏压缩大型语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16719) \u003Cbr> 亚亚·西、克里斯托夫·塞里萨拉、伊琳娜·伊利娜 |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_1eb14047844d.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16719)|[\u002F\u002F]: #12\u002F30\n|[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITSZ-HLT\u002FFSA-Distillation.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FHITSZ-HLT\u002FFSA-Distillation)\u003Cbr>[从大型语言模型中蒸馏细粒度的情感理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18552) \u003Cbr> 张一策、谢光宇、徐洪玲、侯凯恒、鲍建竹、王乾龙、陈世伟、许瑞峰 |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_67b22951fef1.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FHITSZ-HLT\u002FFSA-Distillation) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18552)|[\u002F\u002F]: #12\u002F30\n|[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falonso130r\u002Fknowledge-distillation.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Falonso130r\u002Fknowledge-distillation)\u003Cbr>[利用响应引导提示增强LLM的知识蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17846) \u003Cbr> 维杰·戈亚尔、穆斯塔法·汗、阿普拉梅亚·提鲁帕蒂、哈维尔·赛尼、迈克尔·拉姆、凯文·朱 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_54c57e4de940.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Falonso130r\u002Fknowledge-distillation) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17846)|[\u002F\u002F]: #12\u002F30\n|[通过反馈驱动的蒸馏提升小型语言模型的数学推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14698) \u003Cbr> 朱勋宇、李健、马灿、王卫平 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d68f45bf243b.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14698)|[\u002F\u002F]: #12\u002F03\n|[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkaistai\u002FGenPI.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkaistai\u002FGenPI)\u003Cbr>[生成式提示内化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15927) \u003Cbr> 申河彬、季磊、龚叶云、金成东、崔恩菲、徐敏俊 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_68b9f4975e3d.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fkaistai\u002FGenPI) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15927)|[\u002F\u002F]: #12\u002F02\n|[SWITCH：与教师一起学习以进行大型语言模型的知识蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19503) \u003Cbr> 具在贤、黄艺琳、金永日、姜泰冠、裴贤京、郑教民 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_480a1ba262b2.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19503)|[\u002F\u002F]: #11\u002F17\n|[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjdeschena\u002Fsdtt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fjdeschena\u002Fsdtt)\u003Cbr>[超越自回归：通过时间自蒸馏实现快速LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21035) \u003Cbr> 贾斯汀·德舍诺、卡格拉尔·古尔切赫雷 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a415fbf37ac6.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fjdeschena\u002Fsdtt) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21035)|[\u002F\u002F]: #11\u002F17\n|[大型语言模型预训练蒸馏：设计空间探索](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16215) \u003Cbr> 彭浩、吕欣、白宇诗、姚子俊、张佳杰、侯磊、李娟子 | |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16215)|[\u002F\u002F]: #10\u002F30\n|[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-coai\u002FMiniPLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FMiniPLM)\u003Cbr>[MiniPLM：用于预训练语言模型的知识蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17215) \u003Cbr> 顾宇贤、周浩、孟凡东、周杰、黄民烈 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_03be076a5a1b.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FMiniPLM) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17215)|[\u002F\u002F]: #10\u002F29\n|[推测性知识蒸馏：通过交错采样弥合师生差距](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11325) \u003Cbr> 徐文达、韩如军、王子峰、黎隆T、马德卡·德鲁夫、李磊、威廉·杨·王、里沙布·阿加瓦尔、李辰宇、托马斯·普菲斯特 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_93add7a80da2.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11325)|[\u002F\u002F]: #10\u002F21\n|[用于语言模型对齐的进化对比蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07513) \u003Cbr> 朱利安·卡茨-萨缪尔斯、李正、尹孝根、尼甘·普里扬卡、徐毅、佩特里切克·瓦茨拉夫、殷兵、奇林比·特里舒尔 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_40e7a306f69f.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07513)|[\u002F\u002F]: #10\u002F13\n\n### 量化\n| 标题与作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fgptq.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fgptq)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'22-blue)]()\u003Cbr> :star: [GPTQ: 用于生成式预训练 Transformer 的高精度后训练量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.17323) \u003Cbr> Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d2d96852491a.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fgptq) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.17323)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fsmoothquant.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fsmoothquant)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICML'23-blue)]() \u003Cbr> :star: [SmoothQuant: 大型语言模型的精确高效后训练量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.10438) \u003Cbr> Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_485158e0b995.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fsmoothquant) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.10438)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fllm-awq.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq) \u003Cbr> :star: [AWQ: 面向 LLM 压缩与加速的激活感知权重量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00978) \u003Cbr> Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c31399b68c00.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00978)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FOmniQuant.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FOmniQuant)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'24-blue)]()\u003Cbr> :star: [OmniQuant: 大型语言模型的全方位校准量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.13137) \u003Cbr> Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_8f42e8cf1b51.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FOmniQuant) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.13137)| [\u002F\u002F]: #Recommend\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Futkarsh-dmx\u002Fproject-resq.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Futkarsh-dmx\u002Fproject-resq)\u003Cbr>[ResQ: 基于低秩残差的大语言模型混合精度量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14363) \u003Cbr> Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_133c5ad7f5b8.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Futkarsh-dmx\u002Fproject-resq) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14363)|[\u002F\u002F]: #12\u002F30\n|[MixLLM: 输出特征间全局混合精度的 LLM 量化及高效系统设计](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14590) \u003Cbr> Zhen Zheng, Xiaonan Song, Chuanjie Liu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_406ccb425ac7.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14590)|[\u002F\u002F]: #12\u002F30\n|[GQSA: 用于加速大型语言模型推理的分组量化与稀疏化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17560) \u003Cbr> Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Xing Mei, Lean Fu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a8d5bc62471b.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17560)|[\u002F\u002F]: #12\u002F30\n|[LSAQ: 针对大型语言模型部署的层特定自适应量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18135) \u003Cbr> Binrui Zeng, Bin Ji, Xiaodong Liu, Jie Yu, Shasha Li, Jun Ma, Xiaopeng Li, Shangwen Wang, Xinran Hong |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_af005f980304.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18135)|[\u002F\u002F]: #12\u002F30\n|[SKIM: 超越后训练量化的任意比特量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04180) \u003Cbr> Runsheng Bai, Qiang Liu, Bo Liu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_54b951e095f8.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04180)|[\u002F\u002F]: #12\u002F09\n|[CPTQuant -- 大型语言模型的一种新型混合精度后训练量化技术](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03599) \u003Cbr> Amitash Nanda, Sree Bhargavi Balija, Debashis Sahoo |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c0ea6a3e486f.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03599)|[\u002F\u002F]: #12\u002F09\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-HPCA'25-blue)]()\u003Cbr>[Anda: 通过可变长度分组激活数据格式解锁高效的 LLM 推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15982) \u003Cbr> Chao Fang, Man Shi, Robin Geens, Arne Symons, Zhongfeng Wang, Marian Verhelst |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_331910be1f6e.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15982)|[\u002F\u002F]: #12\u002F03\n|[MixPE: 面向高效 LLM 推理的量化与硬件协同设计](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16158) \u003Cbr> Yu Zhang, Mingzi Wang, Lancheng Zou, Wulong Liu, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_ce747b7ba5c9.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16158)|[\u002F\u002F]: #12\u002F03\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fabdelfattah-lab\u002FBitMoD-HPCA-25.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fabdelfattah-lab\u002FBitMoD-HPCA-25)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-HPCA'25-blue)]()\u003Cbr>[BitMoD: 比特串混合数据类型 LLM 加速](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11745) \u003Cbr> Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_10b7c5d90bb9.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fabdelfattah-lab\u002FBitMoD-HPCA-25) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11745)|[\u002F\u002F]: #11\u002F24\n|[AMXFP4: 使用非对称微尺度浮点数驯服激活异常值，实现 4 位 LLM 推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09909) \u003Cbr> Janghwan Lee, Jiwoong Park, Jinseok Kim, Yongjik Kim, Jungju Oh, Jinwook Oh, Jungwook Choi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0668b4d10c45.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09909)|[\u002F\u002F]: #11\u002F24\n|[Bi-Mamba: 向精确的 1 位状态空间模型迈进](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11843) \u003Cbr> Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0dfc5081f9d2.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11843)|[\u002F\u002F]: #11\u002F24\n|[\"给我 BF16 还是让我死\"? LLM 量化中的准确率-性能权衡](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02355) \u003Cbr> Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02355)|[\u002F\u002F]: #11\u002F18\n|[GWQ: 面向大型语言模型的梯度感知权重量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00850) \u003Cbr> Yihua Shao, Siyu Liang, Xiaolin Lin, Zijian Ling, Zixian Zhu 等 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e0ea9d819f2f.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00850)|[\u002F\u002F]: #11\u002F18\n|[大型语言模型量化技术综合研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02530) \u003Cbr> Jiedong Lang, Zhehao Guo, Shuyu Huang | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02530)|[\u002F\u002F]: #11\u002F18\n|[BitNet a4.8: 用于 1 位 LLM 的 4 位激活](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04965) \u003Cbr> Hongyu Wang, Shuming Ma, Furu Wei |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b66d18e35824.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04965)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIntelligent-Computing-Lab-Yale\u002FTesseraQ.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FIntelligent-Computing-Lab-Yale\u002FTesseraQ)\u003Cbr>[TesseraQ: 基于块重建的超低比特 LLM 后训练量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19103) \u003Cbr> Yuhang Li, Priyadarshini Panda |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_05495a561b8a.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FIntelligent-Computing-Lab-Yale\u002FTesseraQ) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19103)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxinghaow99\u002FBitStack.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fxinghaow99\u002FBitStack)\u003Cbr>[BitStack: 在可变内存环境下压缩大型语言模型的细粒度尺寸控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23918) \u003Cbr> Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fxinghaow99\u002FBitStack\u002Fraw\u002Fmain\u002Fassets\u002Fbitstack.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fxinghaow99\u002FBitStack) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23918)|[\u002F\u002F]: #11\u002F17\n|[推理加速策略对 LLM 偏见的影响](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22118) \u003Cbr> Elisabeth Kirsten, Ivan Habernal, Vedant Nanda, Muhammad Bilal Zafar | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22118)|[\u002F\u002F]: #11\u002F17\n|[理解大型语言模型低精度后训练量化的难度](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14570) \u003Cbr> Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0863bb6cf799.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14570)|[\u002F\u002F]: #10\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FBitNet.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FBitNet)\u003Cbr>[1 位 AI 基础设施: 第 1.1 部分，CPU 上快速无损的 BitNet b1.58 推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16144) \u003Cbr> Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e24eda1bd4b2.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FBitNet) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16144)|[\u002F\u002F]: #10\u002F30\n|[QuAILoRA: 面向 LoRA 的量化感知初始化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14713) \u003Cbr> Neal Lawton, Aishwarya Padmakumar, Judith Gaspers, Jack FitzGerald, Anoop Kumar, Greg Ver Steeg, Aram Galstyan | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14713)|[\u002F\u002F]: #10\u002F30\n|[在低资源语言基准上评估量化大型语言模型的代码生成能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14766) \u003Cbr> Enkhbold Nyamsuren | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14766)|[\u002F\u002F]: #10\u002F30\n| [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FSqueezeLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezeLLM) \u003Cbr> :star: [SqueezeLLM: 密集与稀疏量化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.07629.pdf) \u003Cbr>Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer | \u003Cimg width=\"1102\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_52458f45dcf9.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezeLLM) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.07629.pdf)| [\u002F\u002F]: #Recommend\n|[金字塔向量量化应用于 LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16926) \u003Cbr> Tycho F. A. van der Ouderaa, Maximilian L. Croci, Agrin Hilmkil, James Hensman |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_03dd993c24b7.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16926)|[\u002F\u002F]: #10\u002F29\n|[SeedLM: 将 LLM 权重压缩为伪随机数生成器种子](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10714) \u003Cbr> Rasoul Shafipour, David Harrison, Maxwell Horton, Jeffrey Marker, Houman Bedayat, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, Saman Naderiparizi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_fddc01e959c3.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10714)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fruikangliu\u002FFlatQuant.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fruikangliu\u002FFlatQuant)\u003Cbr>[FlatQuant: 平坦性对 LLM 量化至关重要](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09426) \u003Cbr> Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_49b9ed3b0545.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fruikangliu\u002FFlatQuant) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09426)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMohammad-Mozaffari\u002Fslim.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMohammad-Mozaffari\u002Fslim)\u003Cbr>[SLiM: LLM 的一次性量化稀疏加低秩近似](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09615) \u003Cbr> Mohammad Mozaffari, Maryam Mehri Dehnavi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_26f885d1beb0.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FMohammad-Mozaffari\u002Fslim) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09615)|[\u002F\u002F]: #10\u002F21\n|[后训练量化大型语言模型的规模法则](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12119) \u003Cbr> Zifei Xu, Alexander Lan, Wanzin Yazar, Tristan Webb, Sayeh Sharify, Xin Wang |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_64e03596e2e6.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12119)|[\u002F\u002F]: #10\u002F21\n|[连续近似用于改进 LLM 的量化感知训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10849) \u003Cbr> He Li, Jianhang Hong, Yuanzhuo Wu, Snehal Adbol, Zonglin Li | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10849)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLuoYingSong\u002FDAQ.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLuoYingSong\u002FDAQ)\u003Cbr>[DAQ: 面向 LLM 的密度感知仅权重后训练量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12187) \u003Cbr> Yingsong Luo, Ling Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_8eb5f9897357.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FLuoYingSong\u002FDAQ) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12187)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fenyac-group\u002FQuamba.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fenyac-group\u002FQuamba)\u003Cbr>[Quamba: 一种针对选择性状态空间模型的后训练量化配方](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13229) \u003Cbr> Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin、Kai-Chiang Wu、Diana Marculescu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_137619497ba8.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fenyac-group\u002FQuamba) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13229)|[\u002F\u002F]: #10\u002F21\n|[AsymKV: 通过逐层非对称量化配置实现 KV 缓存的 1 位量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13212) \u003Cbr> Qian Tao, Wenyuan Yu, Jingren Zhou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_9df2fb11f2dc.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13212)|[\u002F\u002F]: #10\u002F21\n|[面向大型语言模型的通道级混合精度量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13056) \u003Cbr> Zihan Chen, Bike Xie, Jundong Li, Cong Shen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_48297dd4109e.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13056)|[\u002F\u002F]: #10\u002F21\n|[渐进式混合精度解码用于高效 LLM 推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13461) \u003Cbr> Hao Mark Chen, Fuwen Tan, Alexandros Kouris、Royson Lee、Hongxiang Fan、Stylianos I. Venieris |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d7e7a905f040.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13461)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAnonymous1252022\u002FEXAQ.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAnonymous1252022\u002FEXAQ)\u003Cbr>[EXAQ: 指数感知量化用于加速 LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03185) \u003Cbr> Moran Shkolnik, Maxim Fishman、Brian Chmiel、Hilla Ben-Yaacov、Ron Banner、Kfir Yehuda Levy |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_902f8f4eb704.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FAnonymous1252022\u002FEXAQ) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03185)|[\u002F\u002F]: #10\u002F14\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenMnZ\u002FPrefixQuant.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FChenMnZ\u002FPrefixQuant)\u003Cbr>[PrefixQuant: 静态量化胜过动态量化，通过 LLM 中的前缀异常值实现](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05265) \u003Cbr> Mengzhao Chen、Yi Liu、Jiahao Wang、Yi Bin、Wenqi Shao、Ping Luo |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0d1a60e04102.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FChenMnZ\u002FPrefixQuant) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05265)|[\u002F\u002F]: #10\u002F14\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvahe1994\u002FAQLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fvahe1994\u002FAQLM)\u003Cbr> :star: [通过加法量化实现大型语言模型的极致压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06118) \u003Cbr> Vage Egiazarian、Andrei Panferov、Denis Kuznedelev、Elias Frantar、Artem Babenko、Dan Alistarh |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f89e976728fd.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fvahe1994\u002FAQLM) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06118)| [\u002F\u002F]: #Recommend\n|[大型语言模型中混合量化的规模法则](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06722) \u003Cbr> Zeyu Cao、Cheng Zhang、Pedro Gimenes、Jianqiao Lu、Jianyi Cheng、Yiren Zhao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_2c88202f3ad1.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06722)|[\u002F\u002F]: #10\u002F14\n|[PalmBench: 移动平台上压缩大型语言模型的综合基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05315) \u003Cbr> Yilong Li、Jingyu Liu、Hao Zhang、M Badri Narayanan、Utkarsh Sharma、Shuai Zhang、Pan Hu、Yijing Zeng、Jayaram Raghuram、Suman Banerjee |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e535ea49bac1.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05315)|[\u002F\u002F]: #10\u002F14\n|[CrossQuant: 一种具有更小量化核的后训练量化方法，用于精确压缩大型语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07505) \u003Cbr> Wenyuan Liu、Xindian Ma、Peng Zhang、Yan Wang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_523afd542b61.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07505)|[\u002F\u002F]: #10\u002F13\n|[SageAttention: 准确的 8 位注意力机制，用于即插即用的推理加速](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02367) \u003Cbr> Jintao Zhang、Jia wei、Pengle Zhang、Jun Zhu、Jianfei Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_99cbb9f69298.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02367)|[\u002F\u002F]: #10\u002F04\n|[对于节能语言模型，只需加法就够了](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00907) \u003Cbr> Hongyin Luo、Wei Sun |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7bddbfc7dce8.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00907)|[\u002F\u002F]: #10\u002F02\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FGuidedQuant.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FGuidedQuant)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICML'25-blue)]()\u003Cbr>[GuidedQuant: 利用末端损失指导进行大型语言模型量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07004) \u003Cbr> Jinuk Kim、Marwa El Halabi、Wonpyo Park、Clemens JS Schaefer、Deokjae Lee、Yeonhong Park、Jae W. Lee、Hyun Oh Song |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_38c086491872.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FGuidedQuant) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07004)|[\u002F\u002F]: #06\u002F15\n\n\n\n### 推理加速\n| 标题与作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FDejaVu.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFMInference\u002FDejaVu)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICML'23%20Oral-blue)]()\u003Cbr> :star: [Deja Vu: 上下文稀疏性用于推理时高效的LLM](https:\u002F\u002Fopenreview.net\u002Fforum?id=wIPIhHd00i) \u003Cbr> 刘子畅, 王珏, 特里·道, 周天一, 袁彬航, 宋昭, 安舒马利·施里瓦斯塔瓦, 张策, 田元东, 克里斯托弗·雷, 陈贝迪 |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f36812be4d63.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FFMInference\u002FDejaVu) \u003Cbr> [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=wIPIhHd00i)| [\u002F\u002F]: #推荐\n| [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fflexflow\u002FFlexFlow.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fflexflow\u002FFlexFlow\u002Ftree\u002Finference) \u003Cbr> :star: [SpecInfer: 利用推测式推理和标记树验证加速生成式LLM服务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.09781) \u003Cbr> 缪旭鹏, 加布里埃莱·奥利亚罗, 张志浩, 程新浩, 王泽宇, 黄锐颖, 陈卓明, 阿尔芬·戴亚安, 雷娜·阿布扬卡尔, 贾志浩| \u003Cimg width=\"600\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_4e13a059dddc.png\">| [Github](https:\u002F\u002Fgithub.com\u002Fflexflow\u002FFlexFlow\u002Ftree\u002Finference) \u003Cbr> [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.09781) | [\u002F\u002F]: #推荐\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fstreaming-llm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fstreaming-llm)\u003Cbr> :star: [利用注意力汇流实现高效流式语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17453) \u003Cbr> 肖广轩, 田元东, 陈贝迪, 韩松, 迈克·刘易斯 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c7620847d34b.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fstreaming-llm) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17453)| [\u002F\u002F]: #推荐\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSafeAILab\u002FEAGLE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSafeAILab\u002FEAGLE)\u003Cbr>:star: [EAGLE: 通过特征外推实现LLM解码的无损加速](https:\u002F\u002Fsites.google.com\u002Fview\u002Feagle-llm) \u003Cbr> 李宇辉, 张超, 张洪洋 |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_64a3f8a5a382.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FSafeAILab\u002FEAGLE) \u003Cbr> [Blog](https:\u002F\u002Fsites.google.com\u002Fview\u002Feagle-llm)| [\u002F\u002F]: #推荐\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FMedusa.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa)\u003Cbr> :star: [Medusa: 带有多解码头的简单LLM推理加速框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10774) \u003Cbr> 蔡天乐, 李宇宏, 耿正阳, 彭洪武, 杰森·D·李, 陈德明, 特里·道 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7477eb4c9b19.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10774)| [\u002F\u002F]: #推荐\n|[基于CTC的草稿模型进行推测式解码以加速LLM推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00061) \u003Cbr> 文卓凡, 桂尚通, 冯洋 |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_8db7063714b2.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00061)|[\u002F\u002F]: #12\u002F09\n|[PLD+: 利用语言模型产物加速LLM推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01447) \u003Cbr> 施韦塔·索马桑达拉姆, 阿尼鲁德·普卡恩, 阿普尔夫·萨克塞纳 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_4a0c386a8e76.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01447)|[\u002F\u002F]: #12\u002F09\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-NeurIPS'24%20ENLSP-blue)]()\u003Cbr>[FastDraft: 如何训练你的草稿](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11055) \u003Cbr> 奥菲尔·扎弗里尔, 伊戈尔·马古利斯, 多林·什泰曼, 盖伊·鲍杜赫 | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11055)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDavid-Li0406\u002FSMoA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDavid-Li0406\u002FSMoA)\u003Cbr>[SMoA: 利用稀疏混合代理改进多智能体大型语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03284) \u003Cbr> 李大伟, 谭振, 钱佩嘉, 李一帆, 库马尔·萨特维克·乔杜里, 胡丽洁, 沈佳怡 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_4f5e7f4a5f32.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FDavid-Li0406\u002FSMoA) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03284)|[\u002F\u002F]: #11\u002F18\n|[The N-Grammys: 利用无学习批处理推测加速自回归推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03786) \u003Cbr> 劳伦斯·斯图尔特, 马修·特拉格, 苏詹·库马尔·戈努贡德拉, 斯特法诺·索阿托 | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03786)|[\u002F\u002F]: #11\u002F18\n|[通过动态执行方法加速AI推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00853) \u003Cbr> 海姆·巴拉德, 雅沙·阿赫特贝格, 周天培, 于珍 | |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00853)|[\u002F\u002F]: #11\u002F18\n|[SuffixDecoding: 一种无需模型的方法来加速大型语言模型推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04975) \u003Cbr> 加布里埃莱·奥利亚罗, 贾志浩, 丹尼尔·坎波斯, 奥里克·乔奥 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_4d71033189b7.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04975)|[\u002F\u002F]: #11\u002F18\n|[针对大型语言模型高效问答的动态策略规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23511) \u003Cbr> 坦迈·帕雷克, 普拉迪奥特·普拉卡什, 亚历山大·拉多维奇, 阿克谢·舍克哈尔, 丹尼斯·萨文科夫 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3037677729c9.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23511)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInfini-AI-Lab\u002FMagicPIG.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FMagicPIG)\u003Cbr>[MagicPIG: LSH采样用于高效LLM生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16179) \u003Cbr> 陈卓明, 拉纳乔伊·萨杜坎, 叶志豪, 周阳, 张建宇, 尼克拉斯·诺尔特, 田元东, 马蒂伊斯·杜兹, 莱昂·博图, 贾志浩, 陈贝迪 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_aefee4162d09.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FMagicPIG) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16179)|[\u002F\u002F]: #10\u002F30\n|[利用张量分解提升多标记预测能力，从而加快语言模型速度](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17765) \u003Cbr> 阿尔乔姆·巴沙林, 安德烈·切尔特科夫, 伊万·奥塞列杰茨 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5613c1b4d139.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17765)|[\u002F\u002F]: #10\u002F29\n|[增强型大型语言模型的高效推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18248) \u003Cbr> 拉娜·沙胡特, 梁聪, 辛世基, 劳千茹, 崔勇, 余敏兰, 米歇尔·米岑马赫尔 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d0954155d282.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18248)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMatteoNulli\u002FVocabulary_pruning.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMatteoNulli\u002FVocabulary_pruning)\u003Cbr>[早期退出LLM中的动态词汇修剪](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18952) \u003Cbr> 乔特·文森蒂, 卡里姆·阿卜杜勒·萨德克, 乔安·维尔哈, 马泰奥·努利, 梅托德·雅兹贝克 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002FMatteoNulli\u002FVocabulary_pruning\u002Fraw\u002Fmain\u002Fsrc\u002Fimages\u002Ffinal_nips.svg\"> |[Github](https:\u002F\u002Fgithub.com\u002FMatteoNulli\u002FVocabulary_pruning) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18952)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangqinsi1\u002FCoreInfer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwangqinsi1\u002FCoreInfer)\u003Cbr>[CoreInfer: 利用语义启发的自适应稀疏激活加速大型语言模型推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18311#) \u003Cbr> 王钦思, 萨伊德·瓦希迪安, 叶汉成, 顾建阳, 张建毅, 陈怡然 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_19d89833b500.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fwangqinsi1\u002FCoreInfer) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18311#)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fduo-attention.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fduo-attention)\u003Cbr>[DuoAttention: 带有检索和流式头部的高效长上下文LLM推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10819) \u003Cbr> 肖广轩, 唐家明, 左景威, 郭俊贤, 杨尚, 唐浩天, 傅瑶, 韩松 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_2c701b8ab49f.jpg\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fduo-attention) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10819)|[\u002F\u002F]: #10\u002F21\n|[DySpec: 借助动态标记树结构实现更快的推测式解码](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11744) \u003Cbr> 熊云帆, 张若雨, 李彦增, 吴天昊, 邹磊 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_05d786052eaf.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11744)|[\u002F\u002F]: #10\u002F21\n|[QSpec: 带有互补量化方案的推测式解码](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11305) \u003Cbr> 赵俊涛, 陆文浩, 王盛, 孔令鹏, 吴川 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_1da06f958d43.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11305)|[\u002F\u002F]: #10\u002F21\n|[TidalDecode: 借助位置持久化稀疏注意力实现快速且准确的LLM解码](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05076) \u003Cbr> 杨丽洁, 张志浩, 陈卓夫, 李子坤, 贾志浩 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b82cd2405e73.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05076)|[\u002F\u002F]: #10\u002F14\n|[ParallelSpec: 并行草稿器用于高效推测式解码](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05589) \u003Cbr> 肖子林, 张鸿明, 葛涛, 欧阳思儒, 奥尔多涅斯·比森特, 于东 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_64b28cb53560.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05589)|[\u002F\u002F]: #10\u002F14\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhemingkx\u002FSWIFT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhemingkx\u002FSWIFT)\u003Cbr>[SWIFT: 用于LLM推理加速的即时自推测式解码](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06916) \u003Cbr> 夏海明, 李永奇, 张军, 杜存晓, 李文杰 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d2b7a7c3e0dd.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fhemingkx\u002FSWIFT) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06916)|[\u002F\u002F]: #10\u002F14\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMooreThreads\u002FTurboRAG.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMooreThreads\u002FTurboRAG)\u003Cbr>[TurboRAG: 利用预计算KV缓存加速分块文本的检索增强生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07590) \u003Cbr> 陆松硕, 王华, 荣玉田, 陈志, 唐耀华 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d41beb613f61.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FMooreThreads\u002FTurboRAG) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07590)|[\u002F\u002F]: #10\u002F13\n|[点滴之力，成就非凡：利用部分上下文实现高效的长上下文训练与推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01485) \u003Cbr> 葛苏宇, 林锡辉, 张宇楠, 韩家伟, 彭浩 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_13d8bec83858.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01485)|[\u002F\u002F]: #10\u002F04\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-SIGMOD'25-blue)]()\u003Cbr>[Cache-Craft: 管理分块缓存以实现高效的检索增强生成](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2502.15734) \u003Cbr> 舒巴姆·阿加瓦尔, 赛·孙达雷桑, 苏布拉塔·米特拉, 德布拉塔·马哈帕特拉, 阿奇特·古普塔, 劳纳克·夏尔马, 尼尔马尔·约书亚·卡普, 佟宇, 希夫·赛尼 |\u003Cimg width=\"600\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a7eaeaf1025d.png\"> | \u003Cbr> [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2502.15734)|[\u002F\u002F]: #02\u002F05\n|[Mamba草稿器用于推测式解码](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01206) \u003Cbr> 代源崔, 欧胜赫, 萨凯特·丁利瓦尔, 塔克·智勋, 金奎荣, 宋宇民, 金瑞珍, 韩仁洙, 申镇宇, 阿拉姆·加尔斯蒂安, 卡蒂亚尔·舒巴姆, 波达帕蒂·斯拉万·巴布 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e49500014f3e.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01206)|[\u002F\u002F]: #06\u002F03\n|[利用无模型推测采样加速测试时缩放](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04708) \u003Cbr> 宋宇民, 丁利瓦尔·萨凯特, 杰扬提·赛·穆拉利达尔, 加内什·巴哈瓦纳, 申镇宇, 加尔斯蒂安·阿拉姆, 波达帕蒂·斯拉万·巴布 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_dde21153ee5c.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04708)|[\u002F\u002F]: #06\u002F05\n\n### 高效的混合专家模型\n| 标题与作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvmazur\u002Fmixtral-offloading.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdvmazur\u002Fmixtral-offloading)\u003Cbr>:star: [通过卸载实现混合专家语言模型的快速推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17238) \u003Cbr> 阿尔乔姆·埃利塞耶夫，丹尼斯·马祖尔 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5bab035f2588.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fdvmazur\u002Fmixtral-offloading) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17238)| [\u002F\u002F]: #推荐\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fduterscmy\u002FCD-MoE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fduterscmy\u002FCD-MoE)\u003Cbr>[精简而非单纯剪枝：提升混合专家层剪枝的效率与性能](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00069) \u003Cbr> 曹明宇，李根，季杰，张佳琪，马晓龙，刘士伟，尹璐 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_1295fc6b67f1.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fduterscmy\u002FCD-MoE) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00069)|[\u002F\u002F]: #12\u002F09\n|[用于高效移动设备推理的缓存条件混合专家模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00099) \u003Cbr> 安德里伊·斯克利亚尔，蒂斯·范·罗曾达尔，罗曼·勒佩尔特，托多尔·博伊诺夫斯基，马特·范·巴伦，马库斯·纳格尔，保罗·沃特莫夫，巴巴克·埃赫特沙米·贝琼迪 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_d4fb821981e8.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00099)|[\u002F\u002F]: #12\u002F09\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEnflameTechnology\u002FDeepSpeed.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FEnflameTechnology\u002FDeepSpeed)\u003Cbr>[MoNTA：基于网络流量感知的并行优化加速混合专家训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00662) \u003Cbr> 郭景明，刘燕，孟宇，陶志伟，刘邦兰，陈刚，李翔 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7d696b0f7b2c.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FEnflameTechnology\u002FDeepSpeed) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00662)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxiaochengsky\u002FMoEI-2.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fxiaochengsky\u002FMoEI-2)\u003Cbr>[MoE-I2：通过专家间剪枝和专家内低秩分解压缩混合专家模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01016) \u003Cbr> 杨成，隋洋，肖金奇，黄凌毅，龚宇，段元林，贾文琪，尹淼，程宇，袁博 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7e36972d2be6.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fxiaochengsky\u002FMoEI-2) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01016)|[\u002F\u002F]: #11\u002F18\n|[HOBBIT：用于快速混合专家推理的混合精度专家卸载系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01433) \u003Cbr> 唐鹏，刘嘉诚，侯晓峰，蒲一飞，王静，彭安妮·恒，李超，郭敏怡 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_51a7a192a919.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01433)|[\u002F\u002F]: #11\u002F18\n|[ProMoE：利用主动缓存实现基于混合专家的快速大模型服务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22134) \u003Cbr> 宋小牛，钟子航，陈荣 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b2fabf32e265.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22134)|[\u002F\u002F]: #11\u002F17\n|[ExpertFlow：优化专家激活与标记分配以实现高效的混合专家推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17954) \u003Cbr> 贺欣，张顺康，王宇鑫，殷海燕，曾子豪，史绍怀，唐振恒，楚晓雯，曾耀宗，翁友顺 |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b34d16633a90.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17954)|[\u002F\u002F]: #10\u002F29\n|[EPS-MoE：用于成本效益高的混合专家推理的专家流水线调度器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12247) \u003Cbr> 钱玉磊，李丰存，姬向阳，赵晓宇，谭建超，张克峰，蔡训良 | |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12247)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAaronhuang-778\u002FMC-MoE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAaronhuang-778\u002FMC-MoE)\u003Cbr>[MC-MoE：混合压缩器为混合专家大模型带来更多收益](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06270) \u003Cbr> 黄伟，廖悦，刘建辉，何瑞菲，谭浩儒，张世明，李洪生，刘思，齐小娟 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3065b47eae2e.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FAaronhuang-778\u002FMC-MoE) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06270)|[\u002F\u002F]: #10\u002F14\n\n### 高效的大型语言模型架构\n| 标题与作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002Fhymba.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fhymba) ![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'25-blue) \u003Cbr>[Hymba：面向小型语言模型的混合头架构](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2411.13676) \u003Cbr> Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_0e09ecb6d280.png\"> |[论文](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2411.13676)|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FMobiLlama.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FMobiLlama)\u003Cbr>:star: [MobiLlama：迈向准确且轻量级的完全透明GPT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16840) \u003Cbr> Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan |\u003Cimg width=\"402\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5097e6423155.gif\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FMobiLlama) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16840) \u003Cbr>[模型](https:\u002F\u002Fhuggingface.co\u002FMBZUAI\u002FMobiLlama-05B) | [\u002F\u002F]: #推荐\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXuezheMax\u002Fmegalodon.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FXuezheMax\u002Fmegalodon)\u003Cbr>:star: [Megalodon：具有无限上下文长度的高效LLM预训练与推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.08801) \u003Cbr> Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3e6d9e588670.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FXuezheMax\u002Fmegalodon) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.08801)| [\u002F\u002F]: #推荐\n|[Taipan：具有选择性注意力的高效且富有表现力的状态空间语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18572) \u003Cbr> Chien Van Nguyen, Huy Huu Nguyen, Thang M. Pham, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Ryan A. Rossi, Trung Bui, Viet Dac Lai, Franck Dernoncourt, Thien Huu Nguyen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_286eb3c2168c.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18572)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FSeerAttention.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSeerAttention)\u003Cbr>[SeerAttention：在您的LLM中学习内在稀疏注意力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13276) \u003Cbr> Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a5fffc1aae14.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSeerAttention) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13276)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTUDa-HWAI\u002FBasis_Sharing.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTUDa-HWAI\u002FBasis_Sharing)\u003Cbr>[Basis Sharing：用于大型语言模型压缩的跨层参数共享](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03765) \u003Cbr> Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, Grace Li Zhang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_53778025f2c6.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FTUDa-HWAI\u002FBasis_Sharing) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03765)|[\u002F\u002F]: #10\u002F14\n|[Rodimus*：通过高效注意力打破准确率与效率的权衡](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06577) \u003Cbr> Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f08feffc0375.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06577)|[\u002F\u002F]: #10\u002F14\n|[压缩、聚合与重新计算：改革Transformer中的长上下文处理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01215) \u003Cbr> Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5f148dd3d747.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01215)|[\u002F\u002F]: #06\u002F03\n\n### KV 缓存压缩\n| 标题与作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|:star: [模型告诉你该丢弃什么：面向大语言模型的自适应 KV 缓存压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01801) \u003Cbr> Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_33749ef570b1.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01801)|[\u002F\u002F]: #推荐\n|[ClusterKV：在语义空间中操作 LLM 的 KV 缓存以实现可检索压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03213) \u003Cbr> Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_db78ee1df40b.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03213)|[\u002F\u002F]: #12\u002F09\n|[使用 LeanKV 统一大语言模型的 KV 缓存压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03131) \u003Cbr> Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5da6131bb0ec.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03131)|[\u002F\u002F]: #12\u002F09\n|[利用层间注意力相似性压缩长上下文 LLM 推理中的 KV 缓存](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02252) \u003Cbr> Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_7aa5d746563e.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02252)|[\u002F\u002F]: #12\u002F09\n|[MiniKV：通过 2 比特的层区分性 KV 缓存突破 LLM 推理极限](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18077) \u003Cbr> Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, Minjia Zhang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_821b2fa29a81.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18077)|[\u002F\u002F]: #12\u002F07\n|[TokenSelect：通过动态令牌级 KV 缓存选择实现 LLM 的高效长上下文推理和长度外推](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02886) \u003Cbr> Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Kun Fu, Zheng Wang, Hui Xiong |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c12bc0cf294e.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02886)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFYYFU\u002FHeadKV.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFYYFU\u002FHeadKV)\u003Cbr>[并非所有头都重要：一种集成检索与推理的头级别 KV 缓存压缩方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19258) \u003Cbr> Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_ebaece5485e2.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FFYYFU\u002FHeadKV) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19258)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJunqiZhao888\u002Fbuzz-llm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FJunqiZhao888\u002Fbuzz-llm)\u003Cbr>[BUZZ：基于蜂巢结构的稀疏 KV 缓存，结合分段热门项以实现高效的 LLM 推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23079) \u003Cbr> Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b61c659616b5.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FJunqiZhao888\u002Fbuzz-llm) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23079)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FwhyNLP\u002FLCKV.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FwhyNLP\u002FLCKV)\u003Cbr>[跨层 KV 共享用于高效 LLM 推理的系统性研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14442) \u003Cbr> You Wu, Haoyi Wu, Kewei Tu |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_ed7355f6f1e9.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FwhyNLP\u002FLCKV) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14442)|[\u002F\u002F]: #10\u002F30\n|[无损 KV 缓存压缩至 2%](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15252) \u003Cbr> Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a57c596932fd.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15252)|[\u002F\u002F]: #10\u002F30\n|[MatryoshkaKV：通过可训练正交投影实现自适应 KV 压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14731) \u003Cbr> Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_89553282783b.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14731)|[\u002F\u002F]: #10\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fiankur\u002Fvqllm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fiankur\u002Fvqllm)\u003Cbr>[大型语言模型中 KV 缓存压缩的残差向量量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15704) \u003Cbr> Ankur Kumar | |[Github](https:\u002F\u002Fgithub.com\u002Fiankur\u002Fvqllm) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15704)|[\u002F\u002F]: #10\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangyifei729\u002FKVSharer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyangyifei729\u002FKVSharer)\u003Cbr>[KVSharer：通过逐层不相似的 KV 缓存共享实现高效推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18517) \u003Cbr> Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_49ab068a8b18.jpg\"> |[Github](https:\u002F\u002Fgithub.com\u002Fyangyifei729\u002FKVSharer) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18517)|[\u002F\u002F]: #10\u002F29\n|[LoRC：具有渐进式压缩策略的大语言模型 KV 缓存低秩压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03111) \u003Cbr> Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_10a0053e5c81.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03111)|[\u002F\u002F]: #10\u002F14\n|[SwiftKV：具有知识保留型模型转换的快速预填充优化推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03960) \u003Cbr> Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_9bccefa9ed53.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03960)|[\u002F\u002F]: #10\u002F14\n|[![发布](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICML'24-blue)]()\u003Cbr>[动态内存压缩：为加速推理而改造 LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09636) \u003Cbr> Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_25361d534e4d.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09636)|[\u002F\u002F]: #10\u002F02\n|[KV-Compress：按注意力头设置可变压缩率的分页 KV 缓存压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00161) \u003Cbr> Isaac Rehg |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c27d348a7f0e.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00161)|[\u002F\u002F]: #10\u002F02\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFFY0\u002FAdaKV.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFFY0\u002FAdaKV)\u003Cbr>[Ada-KV：通过自适应预算分配优化 KV 缓存淘汰以实现高效 LLM 推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11550) \u003Cbr> Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5944a12e847b.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FFFY0\u002FAdaKV) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11550)|[\u002F\u002F]: #10\u002F13\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FContext-Memory.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FContext-Memory) [![发布](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ICLR'24-blue)]() \u003Cbr>[用于在线语言模型交互的压缩上下文记忆](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03414) \u003Cbr> Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song |\u003Cimg width=\"902\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e2bd86a5a5ce.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FContext-Memory) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03414)|[\u002F\u002F]: #10\u002F13\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FKVzip.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FKVzip)\u003Cbr>[KVzip：具有上下文重建功能的查询无关 KV 缓存压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23416) \u003Cbr> Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a4683bbb6335.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FKVzip) [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23416)|[\u002F\u002F]: #05\u002F29\n\n### 文本压缩\n| 标题与作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'23-blue)]()\u003Cbr>:star: [LLMLingua：用于加速大型语言模型推理的提示压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05736) \u003Cbr> 江辉强、吴千慧、林振宇、杨宇青、邱丽丽 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_416e1066617b.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05736)| [\u002F\u002F]: #推荐\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falipay\u002FL3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Falipay\u002FL3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression)\u003Cbr>[L3TC：利用RWKV实现学习型无损低复杂度文本压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16642) \u003Cbr> 张俊轩、程正雪、赵岩、王世豪、周大江、陆国、宋力 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_edb6bd449a62.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Falipay\u002FL3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16642)|[\u002F\u002F]: #12\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNL2G\u002Fpromptoptme.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNL2G\u002Fpromptoptme)\u003Cbr>[PromptOptMe：基于LLM的MT评估指标中的误差感知提示压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16120) \u003Cbr> 丹尼尔·拉里奥诺夫、施特芬·埃格尔 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a4176537b8bd.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FNL2G\u002Fpromptoptme) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16120)|[\u002F\u002F]: #12\u002F30\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)\u003Cbr>:star: [LongLLMLingua：通过提示压缩加速并增强长上下文场景下的LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06839) \u003Cbr> 江辉强、吴千慧、罗旭芳、李东升、林振宇、杨宇青、邱丽丽 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_75bde773ef3c.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06839)| [\u002F\u002F]: #推荐\n|[关于全注意力机制的银弹还是折衷方案？基于摘要标记的上下文压缩综合研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17483) \u003Cbr> 邓晨龙、张志松、毛克隆、李帅毅、黄欣婷、于东、窦志成 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f62c7f75f82f.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17483)|[\u002F\u002F]: #12\u002F30\n|[JPPO：联合功率与提示优化以加速大型语言模型服务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18010) \u003Cbr> 游飞然、杜洪洋、黄凯斌、阿巴斯·贾马利普尔 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_5f444bdfcf92.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18010)|[\u002F\u002F]: #12\u002F07\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkaistai\u002FGenPI.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkaistai\u002FGenPI)\u003Cbr>[生成式提示内化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15927) \u003Cbr> 申河彬、季磊、龚叶云、金成东、崔恩菲、徐敏俊 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_68b9f4975e3d.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fkaistai\u002FGenPI) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15927)|[\u002F\u002F]: #12\u002F02\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnoelkelias\u002Fmultitok.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fnoelkelias\u002Fmultitok)\u003Cbr>[MultiTok：基于LZW压缩改进的高效LLM可变长度分词](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21548) \u003Cbr> 诺埃尔·埃利亚斯、霍玛·埃斯法哈尼扎德、卡安·卡莱、斯里拉姆·维什瓦纳特、穆里埃尔·梅达尔 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c6e7649370d9.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fnoelkelias\u002Fmultitok) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21548)|[\u002F\u002F]: #11\u002F17\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]()\u003Cbr>[Selection-p：自监督的任务无关提示压缩，兼顾忠实性和可迁移性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11786) \u003Cbr> 钟子婷、崔乐阳、刘乐茂、黄欣婷、史书铭、杨迪彦 |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f27312d71db2.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11786)|[\u002F\u002F]: #10\u002F21\n|[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]()\u003Cbr>[从阅读到压缩：探索多文档阅读器在提示压缩中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04139) \u003Cbr> 崔恩成、李善京、崔珉珍、朴俊、李钟郁 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_cbf1bd50fb1a.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04139)|[\u002F\u002F]: #10\u002F14\n|[感知压缩器：一种无需训练的长上下文场景下提示压缩方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.19272) \u003Cbr> 唐继伟、许进、卢廷伟、林海、赵一鸣、郑海涛 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_c12eb911caed.png\"> |[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.19272)|[\u002F\u002F]: #10\u002F02\n| [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWorkday\u002Fcpc.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FWorkday\u002Fcpc)![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-AAAI'25-blue)\u003Cbr>[具有上下文感知句子编码的提示压缩，用于快速且改进的LLM推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.01227) \u003Cbr> 巴里斯·利斯卡韦茨、马克西姆·乌沙科夫、舒文杜·罗伊、马克·克利巴诺夫、阿里·埃特马德、谢恩·卢克 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_8c1ada575b42.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FWorkday\u002Fcpc) \u003Cbr> [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.01227)|[\u002F\u002F]: #12\u002F30\n| [任务无关的提示压缩：结合上下文感知句子嵌入与奖励引导的任务描述符](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13374v1) \u003Cbr> 巴里斯·利斯卡韦茨、舒文杜·罗伊、马克西姆·乌沙科夫、马克·克利巴诺夫、阿里·埃特马德、谢恩·卢克 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_89e9fe6f2989.png\"> | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13374v1)|[\u002F\u002F]: #12\u002F30\n\n### 低秩分解\n| 标题及作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|[![发布](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F会议-NeurIPS'24-blue)]()\u003Cbr>[ESPACE：用于模型压缩的激活维度约简](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05437) \u003Cbr> Charbel Sakr, Brucek Khailany |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_996e9b05616b.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05437)|[\u002F\u002F]: #10\u002F14\n|[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fselfsupervised-ai\u002FNatural-GaLore.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore)\u003Cbr>[Natural GaLore：加速GaLore以实现内存高效的LLM训练和微调](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029) \u003Cbr> Arijit Das | |[Github](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029)|[\u002F\u002F]: #10\u002F30\n|[CompAct：用于内存高效LLM训练的压缩激活](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15352) \u003Cbr> Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_736f3e6ba671.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15352)|[\u002F\u002F]: #10\u002F30\n\n### 硬件\u002F系统\u002F推理服务\n| 标题及作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|[KunServe：基于参数中心内存管理的弹性高效大语言模型推理服务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18169) \u003Cbr> Rongxin Cheng, Yifan Peng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_23cbed4508da.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18169)|[\u002F\u002F]: #12\u002F30\n|[FastSwitch：在公平性驱动的大语言模型推理服务中优化上下文切换效率](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18424) \u003Cbr> Ao Shen, Zhiyao Li, Mingyu Gao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_a048911f1bd0.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18424)|[\u002F\u002F]: #12\u002F07\n|[CE-CoLLM：通过云边协同实现高效自适应大语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02829) \u003Cbr> Hongpeng Jin, Yanzhao Wu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3e104160d403.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02829)|[\u002F\u002F]: #11\u002F18\n|[Ripple：通过相关性感知的神经元管理加速智能手机上的LLM推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19274) \u003Cbr> Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren |\u003Cimg width=\"302\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b07073617c87.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19274)|[\u002F\u002F]: #11\u002F17\n|[![发布](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F会议-ICCAD'24-blue)]()\u003Cbr>[ALISE：通过推测调度加速大语言模型推理服务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23537) \u003Cbr> Youpeng Zhao, Jun Wang |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b563852ec107.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23537)|[\u002F\u002F]: #11\u002F17\n|[EPIC：用于大语言模型推理服务的高效位置无关上下文缓存](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15332) \u003Cbr> Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_2c564245970e.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15332)|[\u002F\u002F]: #10\u002F30\n|[![发布](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F会议-NeurIPS'24-blue)]()\u003Cbr>[SDP4Bit：迈向LLM训练中分片数据并行的4位通信量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15526) \u003Cbr> Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, Dingwen Tao |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_224a6f45ced6.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15526)|[\u002F\u002F]: #10\u002F30\n|[FastAttention：将FlashAttention2扩展到NPU和低资源GPU](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16663) \u003Cbr> Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan等 |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b9382bd22920.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16663)|[\u002F\u002F]: #10\u002F29\n|[POD-Attention：解锁完整的预填充-解码重叠以加速LLM推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18038) \u003Cbr> Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_1bd4a09dba02.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18038)|[\u002F\u002F]: #10\u002F29\n|[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLizonghang\u002FTPI-LLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLizonghang\u002FTPI-LLM)\u003Cbr>[TPI-LLM：在低资源边缘设备上高效服务70B规模的LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00531) \u003Cbr> Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_2238a97214ad.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FLizonghang\u002FTPI-LLM) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00531)|[\u002F\u002F]: #10\u002F02\n\n### 高效微调\n| 标题与作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|[![发表](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-ACL'25%20Findings-blue)]()\u003Cbr>[LoRMA：用于大语言模型的低秩乘法适配](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07621) \u003Cbr> Harsh Bihany, Shubham Patel, Ashutosh Modi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_e69e728376d1.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07621)|[\u002F\u002F]: #06\u002F16\n|[HELENE：利用二阶优化加速大语言模型微调的海森矩阵分层裁剪与梯度退火](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10696) \u003Cbr> Huaqin Zhao, Jiaxi Li, Yi Pan, Shizhe Liang, Xiaofeng Yang, Wei Liu, Xiang Li, Fei Dou, Tianming Liu, Jin Lu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_bb3c4259083d.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10696)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLCS2-IIITD\u002FMonteCLoRA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLCS2-IIITD\u002FMonteCLoRA)\u003Cbr>[基于贝叶斯重参数化的低秩适配实现鲁棒高效的大语言模型微调](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04358) \u003Cbr> Ayan Sengupta, Vaibhav Seth, Arinjay Pathak, Natraj Raman, Sriram Gopalakrishnan, Tanmoy Chakraborty |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_6233fbed144e.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FLCS2-IIITD\u002FMonteCLoRA) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04358)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fselfsupervised-ai\u002FNatural-GaLore.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore)\u003Cbr>[Natural GaLore：加速GaLore以实现内存高效的大语言模型训练和微调](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029) \u003Cbr> Arijit Das | |[Github](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029)|[\u002F\u002F]: #10\u002F30\n|[少即是多：用于高效微调大语言模型的极端梯度提升秩1适配](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19694) \u003Cbr> Yifei Zhang, Hao Zhu, Aiwei Liu, Han Yu, Piotr Koniusz, Irwin King |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_88d7e76e9be1.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19694)|[\u002F\u002F]: #11\u002F18\n|[![发表](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]()\u003Cbr>[MiLoRA：用于大语言模型微调的高效低秩适配混合方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18035) \u003Cbr> Jingfan Zhang, Yi Zhao, Dan Chen, Xing Tian, Huanran Zheng, Wei Zhu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b2ce406bf6f1.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18035)|[\u002F\u002F]: #10\u002F29\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKowsher\u002FRoCoFT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FKowsher\u002FRoCoFT)\u003Cbr>[RoCoFT：通过行-列更新实现大语言模型的高效微调](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10075) \u003Cbr> Md Kowsher, Tara Esmaeilbeig, Chun-Nam Yu, Mojtaba Soltanalian, Niloofar Yousefi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_45f0a6ddca9c.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FKowsher\u002FRoCoFT) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10075)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKaiseem\u002FIST.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FKaiseem\u002FIST)[![发表](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24-blue)]()\u003Cbr>[层间重要性至关重要：在大语言模型的参数高效微调中用更少的内存获得更好的性能](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11772) \u003Cbr> Kai Yao, Penlei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, Jianke Zhu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_2f0b0fd5a5bd.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FKaiseem\u002FIST) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11772)|[\u002F\u002F]: #10\u002F21\n|[![发表](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-Nature%20Scientific%20Reports-blue)]()\u003Cbr>[利用语义知识调优进行大语言模型的参数高效微调](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08598) \u003Cbr> Nusrat Jahan Prottasha, Asif Mahmud, Md. Shohanur Islam Sobuj, Prakash Bhat, Md Kowsher, Niloofar Yousefi, Ozlem Ozmen Garibay |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_ebf0219fc8fe.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08598)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxvyaward\u002Fqeft.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fxvyaward\u002Fqeft)[![发表](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]()\u003Cbr>[QEFT：用于高效微调大语言模型的量化方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08661) \u003Cbr> Changhun Lee, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_9e61a9585690.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fxvyaward\u002Fqeft) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08661)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAofei-Chang\u002FBIPEFT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAofei-Chang\u002FBIPEFT)[![发表](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24%20Findings-blue)]()\u003Cbr>[BIPEFT：基于预算指导的迭代搜索，用于大型预训练语言模型的参数高效微调](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09079) \u003Cbr> Aofei Chang, Jiaqi Wang, Han Liu, Parminder Bhatia, Cao Xiao, Ting Wang, Fenglong Ma |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_b49f88abca19.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FAofei-Chang\u002FBIPEFT) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09079)|[\u002F\u002F]: #10\u002F21\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsayankotor\u002Fsparse_grads.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsayankotor\u002Fsparse_grads)\u003Cbr>[SparseGrad：一种选择性方法，用于高效微调MLP层](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07383) \u003Cbr> Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev, Alexander Panchenko, Ivan Oseledets |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_072c7242cd1b.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fsayankotor\u002Fsparse_grads) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07383)|[\u002F\u002F]: #10\u002F13\n|[SpaLLM：利用草图技术实现大语言模型的统一压缩适配](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06364) \u003Cbr> Tianyi Zhang, Junda Su, Oscar Wu, Zhaozhuo Xu, Anshumali Shrivastava |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_465c6a247982.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06364)|[\u002F\u002F]: #10\u002F13\n\n### 高效训练\n| 标题与作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fneiterman21\u002FLDB.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fneiterman21\u002FLDB)\u003Cbr>[LayerDropBack：一种通用的加速深度网络训练方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18027) \u003Cbr> Evgeny Hershkovitch Neiterman, Gil Ben-Artzi |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3a004c1b9def.png\"> |[Github](https:\u002F\u002Fgithub.com\u002Fneiterman21\u002FLDB) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18027)|[\u002F\u002F]: #12\u002F30\n|[AutoMixQ：用于高性能、内存高效的微调的自适应量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13814) \u003Cbr> Changhai Zhou, Shiyang Zhang, Yuhua Zhou, Zekai Liu, Shichao Weng |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_6228f3424bfd.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13814)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FLPA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FLPA)[![Publish](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FConference-EMNLP'24-blue)]()\u003Cbr>[基于低维投影注意力的大规模语言模型可扩展高效训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02063) \u003Cbr> Xingtai Lv, Ning Ding, Kaiyan Zhang, Ermo Hua, Ganqu Cui, Bowen Zhou |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_3e669dc89fb9.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FLPA) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02063)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FCOAT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FCOAT)\u003Cbr>[COAT：用于内存高效FP8训练的优化器状态与激活压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19313) \u003Cbr> Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, Song Han |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_26370ac3c601.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FCOAT) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19313)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwuhouming\u002FBitPipe.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwuhouming\u002FBitPipe)\u003Cbr>[BitPipe：用于加速大模型训练的双向交错流水线并行](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19367) \u003Cbr> Houming Wu, Ling Chen, Wenjie Yu |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fwuhouming\u002FBitPipe\u002Fraw\u002Fmain\u002Fdocs\u002FBitPipe_images\u002FBitPipe-v.svg\"> |[Github](https:\u002F\u002Fgithub.com\u002Fwuhouming\u002FBitPipe) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19367)|[\u002F\u002F]: #11\u002F17\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fselfsupervised-ai\u002FNatural-GaLore.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore)\u003Cbr>[Natural GaLore：加速GaLore以实现内存高效的LLM训练和微调](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029) \u003Cbr> Arijit Das | |[Github](https:\u002F\u002Fgithub.com\u002Fselfsupervised-ai\u002FNatural-GaLore) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16029)|[\u002F\u002F]: #10\u002F30\n|[CompAct：用于内存高效LLM训练的压缩激活](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15352) \u003Cbr> Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster |\u003Cimg width=\"202\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_736f3e6ba671.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15352)|[\u002F\u002F]: #10\u002F30\n\n\n\n### 综述（或基准测试）\n| 标题与作者 | 简介 | 链接 |\n|:--|  :----: | :---:|\n|[深入探讨高效推理方法：推测解码综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13157) \u003Cbr> Hyun Ryu, Eric Kim |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_75d60a983a59.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13157)|[\u002F\u002F]: #11\u002F24\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fargonne-lcf\u002FLLM-Inference-Bench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fargonne-lcf\u002FLLM-Inference-Bench)\u003Cbr>[LLM-Inference-Bench：AI加速器上大型语言模型的推理基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00136) \u003Cbr> Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus等 | |[Github](https:\u002F\u002Fgithub.com\u002Fargonne-lcf\u002FLLM-Inference-Bench) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00136)|[\u002F\u002F]: #11\u002F18\n|[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZongqianLi\u002FPrompt-Compression-Survey.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZongqianLi\u002FPrompt-Compression-Survey)\u003Cbr>[大型语言模型中的提示压缩：综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12388) \u003Cbr> Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_f6515b4dff5d.png\"> |[Github](https:\u002F\u002Fgithub.com\u002FZongqianLi\u002FPrompt-Compression-Survey) \u003Cbr> [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12388)|[\u002F\u002F]: #10\u002F21\n|[大型语言模型推理加速：全面的硬件视角](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04466) \u003Cbr> Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai |\u003Cimg width=\"1002\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_readme_779b11575682.png\"> |[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04466)|[\u002F\u002F]: #10\u002F14","# Awesome-Efficient-LLM 快速上手指南\n\n`Awesome-Efficient-LLM` 并非一个单一的 Python 库或可执行软件，而是一个**精选资源列表（Curated List）**，汇集了关于高效大语言模型（Efficient LLM）的前沿论文、开源代码和项目。它涵盖了网络剪枝、知识蒸馏、量化、推理加速等关键领域。\n\n本指南将帮助开发者如何利用该列表快速找到适合的工具并运行相关代码。\n\n## 1. 环境准备\n\n由于列表中包含多个独立的开源项目，环境需求取决于你具体选择的研究方向（如剪枝、量化或蒸馏）。以下是通用的基础环境建议：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04\u002F22.04) 或 macOS。\n*   **Python 版本**: 建议 `Python 3.9` 或更高版本。\n*   **核心依赖**:\n    *   `PyTorch`: 大多数项目基于 PyTorch (建议 2.0+)。\n    *   `Transformers`: Hugging Face `transformers` 库。\n    *   `CUDA`: 若需 GPU 加速，请确保安装匹配的 NVIDIA 驱动和 CUDA Toolkit。\n*   **工具**: `git`, `pip` 或 `conda`。\n\n> **提示**: 在访问 GitHub 仓库或下载论文时，国内开发者可使用 **Gitee 镜像**（如果项目有同步）或配置 **GitHub 加速代理** 以提升克隆速度。\n\n## 2. 安装与获取步骤\n\n由于这是一个资源索引，不存在统一的 `pip install awesome-efficient-llm` 命令。请使用以下步骤获取目标项目的代码：\n\n### 第一步：浏览与选择\n访问 [Awesome-Efficient-LLM GitHub 主页](https:\u002F\u002Fgithub.com\u002Ffscdc\u002FAwesome-Efficient-LLM)，根据需求点击对应的子分类（例如 `Network Pruning \u002F Sparsity` 或 `Quantization`），找到带有 :star: 标记的推荐论文及其 GitHub 链接。\n\n### 第二步：克隆目标项目\n假设你选择了列表中的热门项目 **SparseGPT**（用于模型剪枝），请在终端执行：\n\n```bash\n# 使用 git 克隆选定的项目仓库\ngit clone https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fsparsegpt.git\n\n# 进入项目目录\ncd sparsegpt\n```\n\n*(注：若直接克隆速度慢，可尝试使用国内镜像源地址，或在 URL 前添加加速前缀)*\n\n### 第三步：安装项目依赖\n每个子项目都有独立的 `requirements.txt`。以 SparseGPT 为例：\n\n```bash\n# 创建虚拟环境（推荐）\npython -m venv venv\nsource venv\u002Fbin\u002Factivate  # Windows 用户请使用: venv\\Scripts\\activate\n\n# 安装依赖\npip install -r requirements.txt\n```\n\n## 3. 基本使用示例\n\n不同子项目的使用方式各异。以下以列表中推荐的 **SparseGPT** 为例，展示如何对一个大语言模型进行一次性剪枝。\n\n### 场景：对 LLaMA 模型进行非结构化剪枝\n\n1.  **准备模型**: 确保你有 Hugging Face 格式的模型权重，或能自动下载。\n2.  **运行剪枝脚本**:\n\n```bash\n# 基本用法示例\n# --model: 模型路径或名称\n# --prune_n: 剪枝比例中的分子 (例如 2)\n# --prune_w: 剪枝比例中的分母 (例如 4, 即 50% 剪枝)\n# --save: 保存剪枝后模型的路径\n\npython main.py \\\n    --model meta-llama\u002FLlama-2-7b-hf \\\n    --prune_n 2 \\\n    --prune_w 4 \\\n    --save .\u002Fpruned_llama_50\n```\n\n### 其他方向快速指引\n*   **量化 (Quantization)**: 查找列表中的 `quantization.md`，通常涉及 `bitsandbytes` 或 `AWQ` 等库，使用类似 `python quantize.py --model ... --wbits 4` 的命令。\n*   **知识蒸馏 (Knowledge Distillation)**: 查找 `knowledge_distillation.md`，通常需要指定教师模型和学生模型进行训练。\n\n## 4. 贡献与更新\n如果你有自己的高效 LLM 论文或项目希望收录：\n1.  填写项目根目录下的 `generate_item.py` 信息。\n2.  运行 `python generate_item.py` 生成 Markdown 格式。\n3.  提交 Pull Request 至原仓库。\n\n> **注意**: 该列表每 90 天会在主页更新一次最新论文，完整历史论文请查阅各子目录文件（如 `pruning.md`）。","某初创团队试图将开源的 70B 参数大语言模型部署到仅有单张消费级显卡的边缘服务器上，以构建低成本的智能客服系统。\n\n### 没有 Awesome-Efficient-LLM 时\n- **选型迷茫**：面对海量的模型压缩论文（如剪枝、量化、蒸馏），团队花费数周盲目试错，难以判断哪些技术适合当前硬件。\n- **显存爆炸**：直接加载全精度模型导致显存瞬间溢出，无法进行任何推理，且缺乏针对 KV Cache 压缩的有效方案。\n- **推理迟缓**：勉强运行的模型响应延迟高达数秒，完全无法满足客服场景对实时互动的要求。\n- **微调成本过高**：尝试全量微调时，因缺乏高效微调（Efficient Fine-tuning）指导，训练资源消耗超出预算十倍。\n\n### 使用 Awesome-Efficient-LLM 后\n- **精准导航**：团队通过\"Quantization\"和\"Hardware\u002FServing\"分类，迅速锁定了适合消费级显卡的 4-bit 量化方案及推理加速框架。\n- **显存优化**：参考\"KV Cache Compression\"板块的最新论文，成功将长上下文对话的显存占用降低了 60%，模型顺利加载。\n- **极速响应**：应用\"Inference Acceleration\"列表中的优化算子，将首字生成延迟从 3 秒压缩至 200 毫秒，体验流畅自然。\n- **低成本适配**：利用\"Efficient Fine-tuning\"推荐的 LoRA 等策略，仅用少量数据即可完成领域适配，训练成本降低 90%。\n\nAwesome-Efficient-LLM 充当了高效大模型落地的“技术雷达”，帮助开发者在有限的算力约束下，以最低成本实现高性能的模型部署与应用。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhorseee_Awesome-Efficient-LLM_384da68b.png","horseee","Ma Xinyin","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhorseee_52d6b46b.jpg","Ph.D. Candidate\r\n @ NUS XML Lab🤔","National University of Singapore","Singapore","maxinyin@u.nus.edu","horseeeMa","horseee.github.io","https:\u002F\u002Fgithub.com\u002Fhorseee",[83],{"name":84,"color":85,"percentage":86},"Python","#3572A5",100,1978,157,"2026-04-05T05:21:27",null,1,"","未说明",{"notes":95,"python":93,"dependencies":96},"该仓库是一个精选列表（Awesome List），汇集了关于高效大语言模型（LLM）的论文、项目和资源，本身不是一个可直接运行的软件工具。因此，README 中未包含具体的操作系统、GPU、内存、Python 版本或依赖库等运行环境需求。用户需根据列表中链接到的具体子项目（如 SparseGPT, LLM-Pruner, Wanda 等）的独立文档来查询相应的环境配置要求。",[],[35,14],[99,100,101,102,103,104,105,106],"compression","knowledge-distillation","language-model","llm","model-quantization","pruning-algorithms","efficient-llm","llm-compression","2026-03-27T02:49:30.150509","2026-04-06T20:03:42.241496",[],[]]