[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-xlite-dev--Awesome-LLM-Inference":3,"tool-xlite-dev--Awesome-LLM-Inference":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":80,"owner_url":80,"languages":81,"stars":86,"forks":87,"last_commit_at":88,"license":89,"difficulty_score":23,"env_os":90,"env_gpu":91,"env_ram":90,"env_deps":92,"category_tags":96,"github_topics":97,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":112,"updated_at":113,"faqs":114,"releases":150},2251,"xlite-dev\u002FAwesome-LLM-Inference","Awesome-LLM-Inference","📚A curated list of Awesome LLM\u002FVLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8\u002F4, Parallelism, etc.🎉","Awesome-LLM-Inference 是一个专为大语言模型（LLM）和视觉语言模型（VLM）推理优化打造的精选资源库。它系统性地整理了大量带有代码实现的前沿学术论文，涵盖了从 FlashAttention、PagedAttention 到 WINT8\u002F4 量化、多卡并行等核心技术领域。\n\n在大模型日益复杂的今天，如何让模型跑得更快、更省显存是行业痛点。Awesome-LLM-Inference 通过分类汇总最新的研究成果，帮助开发者快速定位解决推理延迟高、显存占用大、长上下文处理难等问题的方案。无论是想要深入理解底层原理，还是寻找可直接落地的工程优化策略，这里都能提供清晰的路径。\n\n该资源库特别适合 AI 工程师、算法研究人员以及对模型部署性能有极致追求的技术团队使用。其独特亮点在于不仅罗列论文，还强调“论文 + 代码”的实战结合，并持续更新如 DeepSeek\u002FMLA 架构、预填充与解码分离（Disaggregated Prefill & Decoding）、混合并行策略等前沿热点。此外，项目还提供了适合初学者的 500 页综述指南，降低了高性能推理技术的学习门槛，是连接学术创","Awesome-LLM-Inference 是一个专为大语言模型（LLM）和视觉语言模型（VLM）推理优化打造的精选资源库。它系统性地整理了大量带有代码实现的前沿学术论文，涵盖了从 FlashAttention、PagedAttention 到 WINT8\u002F4 量化、多卡并行等核心技术领域。\n\n在大模型日益复杂的今天，如何让模型跑得更快、更省显存是行业痛点。Awesome-LLM-Inference 通过分类汇总最新的研究成果，帮助开发者快速定位解决推理延迟高、显存占用大、长上下文处理难等问题的方案。无论是想要深入理解底层原理，还是寻找可直接落地的工程优化策略，这里都能提供清晰的路径。\n\n该资源库特别适合 AI 工程师、算法研究人员以及对模型部署性能有极致追求的技术团队使用。其独特亮点在于不仅罗列论文，还强调“论文 + 代码”的实战结合，并持续更新如 DeepSeek\u002FMLA 架构、预填充与解码分离（Disaggregated Prefill & Decoding）、混合并行策略等前沿热点。此外，项目还提供了适合初学者的 500 页综述指南，降低了高性能推理技术的学习门槛，是连接学术创新与工业落地的高效桥梁。","\n\u003Cdiv align='center'>\n  \u003Cimg src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ffcd83ff2-7ace-4fb5-8d3b-3ccbc1ecbf87 width=250px >\n\u003C\u002Fdiv>\n\n\u003Cdiv align='center'>\n  \u003Cimg src=https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fdownloads\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Ftotal?color=ccf&label=downloads&logo=github&logoColor=lightgrey >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002FAwesome-LLM-Inference.svg?style=social >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRelease-v2.6-brightgreen.svg >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-GPLv3.0-turquoise.svg >\n \u003C\u002Fdiv>\n\n## 📒Introduction\nAwesome-LLM-Inference: A curated list of [📙Awesome LLM Inference Papers with Codes](#paperlist). For Awesome Diffusion Inference, please check 📖[Awesome-DiT-Inference](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-DiT-Inference)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002FAwesome-DiT-Inference.svg?style=social). For CUDA learn notes, please check 📖[LeetCUDA](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FLeetCUDA)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002FLeetCUDA.svg?style=social).\n\n## 📖 News 🔥🔥\n\u003Cdiv id=\"news\">\u003C\u002Fdiv>\n\n- [2026\u002F03] Cache-DiT **[🎉v1.3.0](https:\u002F\u002Fgithub.com\u002Fvipshop\u002Fcache-dit)** release is ready, the major updates including: [Ring](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL) Attention w\u002F [batched P2P](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL), [USP](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL\u002F) (Hybrid Ring and Ulysses), Hybrid 2D and 3D Parallelism (💥[USP + TP](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FHYBRID_PARALLEL\u002F)),  VAE-P Comm overhead reduce.\n\n![arch](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_de7e08501500.png)\n\n## ©️Citations\n\n```BibTeX\n@misc{Awesome-LLM-Inference@2024,\n  title={Awesome-LLM-Inference: A curated list of Awesome LLM Inference Papers with codes},\n  url={https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference},\n  note={Open-source software available at https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference},\n  author={xlite-dev, liyucheng09 etc},\n  year={2024}\n}\n```\n\n## 🎉Awesome LLM Inference Papers with Codes\n\n[Awesome LLM Inference for Beginners.pdf](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Freleases\u002Fdownload\u002Fv0.3\u002FAwesome-LLM-Inference-v0.3.pdf.zip): 500 pages, FastServe, FlashAttention 1\u002F2, FlexGen, FP8, LLM.int8(), PagedAttention, RoPE, SmoothQuant, WINT8\u002F4, Continuous Batching, ZeroQuant 1\u002F2\u002FFP, AWQ etc.\n\n\u003Cdiv align='center'>\n\u003Cimg src=https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fassets\u002F31974251\u002F0ed77e9d-a1eb-4095-9a82-bad624964e55 >\n\u003C\u002Fdiv>\n\n## 🎉Download All PDFs\n```bash\npython3 download_pdfs.py # The code is generated by Doubao AI\n```\n\u003Cimg width=\"1267\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_6775f92f179b.png\" \u002F>\n\n\u003Cdiv id=\"paperlist\">\u003C\u002Fdiv>\n\n## 📖Contents\n* 📖[Trending LLM\u002FVLM Topics](#Trending-LLM-VLM-Topics)🔥🔥🔥\n* 📖[DeepSeek\u002FMLA Topics](#mla)🔥🔥🔥\n* 📖[Multi-GPUs\u002FMulti-Nodes Parallelism](#DP-MP-PP-TP-SP-CP)🔥🔥🔥\n* 📖[Disaggregating Prefill and Decoding](#P-D-Disaggregating)🔥🔥🔥\n* 📖[LLM Algorithmic\u002FEval Survey](#LLM-Algorithmic-Eval-Survey)\n* 📖[LLM Train\u002FInference Framework\u002FDesign](#LLM-Train-Inference-Framework)\n* 📖[Weight\u002FActivation Quantize\u002FCompress](#Weight-Activation-Quantize-Compress)🔥\n* 📖[Continuous\u002FIn-flight Batching](#Continuous-In-flight-Batching)\n* 📖[IO\u002FFLOPs-Aware\u002FSparse Attention](#IO-FLOPs-Aware-Attention-Sparse)🔥\n* 📖[KV Cache Scheduling\u002FQuantize\u002FDropping](#KV-Cache-Scheduling-Quantize-Dropping)🔥\n* 📖[Prompt\u002FContext Compression](#Context-Compression)🔥\n* 📖[Long Context Attention\u002FKV Cache Optimization](#Long-Context-Attention-KVCache)🔥🔥\n* 📖[Early-Exit\u002FIntermediate Layer Decoding](#Early-Exit)\n* 📖[Parallel Decoding\u002FSampling](#Parallel-Decoding-Sampling)🔥\n* 📖[Structured Prune\u002FKD\u002FWeight Sparse](#Structured_Pruning_KD_Weight_Sparse)\n* 📖[Mixture-of-Experts(MoE) LLM Inference](#Mixture_of_Experts_LLM_Inference)🔥\n* 📖[CPU\u002FNPU\u002FFPGA\u002FMobile Inference](#CPU-Single-GPU-Inference)\n* 📖[Non Transformer Architecture](#Non-Transformer-Architecture)🔥\n* 📖[GEMM\u002FTensor Cores\u002FWMMA\u002FParallel](#GEMM-Tensor-Cores-WMMA)\n* 📖[VLM\u002FPosition Embed\u002FOthers](#Others)\n\n### 📖Trending LLM\u002FVLM Topics ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Trending-LLM-VLM-Topics\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2024.04| 🔥🔥🔥[Open-Sora] Open-Sora: Democratizing Efficient Video Production for All(@hpcaitech)|[[docs]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora\u002Fblob\u002Fmain\u002Fdocs\u002Fzh_CN\u002FREADME.md) | [[Open-Sora]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhpcaitech\u002FOpen-Sora.svg?style=social)| ⭐️⭐️ |\n|2024.04| 🔥🔥🔥[Open-Sora Plan] Open-Sora Plan: This project aim to reproduce Sora (Open AI T2V model)(@PKU)|[[report]](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan\u002Fblob\u002Fmain\u002Fdocs\u002FReport-v1.0.0.md) | [[Open-Sora-Plan]](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FOpen-Sora-Plan.svg?style=social)| ⭐️⭐️ |\n|2024.05| 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04434) | [[DeepSeek-V2]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V2) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V2.svg?style=social)| ⭐️⭐️ |\n|2024.05|🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05254) | [[unilm-YOCO]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002FYOCO) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social) |⭐️⭐️ |\n|2024.06|🔥[**Mooncake**] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) |[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v3.pdf) | [[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**FlashAttention-3**] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) |[[pdf]](https:\u002F\u002Ftridao.me\u002Fpublications\u002Fflash3\u002Fflash3.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**MInference 1.0**] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.02490)|[[MInference 1.0]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference.svg?style=social)|⭐️⭐️ |\n|2024.11|🔥🔥🔥[**Star-Attention: 11x~ speedup**] Star Attention: Efficient LLM Inference over Long Sequences(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.17116)|[[Star-Attention]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FStar-Attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FStar-Attention.svg?style=social)|⭐️⭐️ |\n|2024.12|🔥🔥🔥[**DeepSeek-V3**] DeepSeek-V3 Technical Report(@deepseek-ai) | [[pdf]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3\u002Fblob\u002Fmain\u002FDeepSeek_V3.pdf) | [[DeepSeek-V3]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V3.svg?style=social) | ⭐️⭐️ |\n|2025.01|🔥🔥🔥 [**MiniMax-Text-01**] MiniMax-01: Scaling Foundation Models with Lightning Attention | [[report]](https:\u002F\u002Ffilecdn.minimax.chat\u002F_Arxiv_MiniMax_01_Report.pdf) | [[MiniMax-01]](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FMiniMax-01) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FMiniMax-01.svg?style=social) | ⭐️⭐️ |\n|2025.01|🔥🔥🔥[**DeepSeek-R1**] DeepSeek-R1 Technical Report(@deepseek-ai) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12948v1) | [[DeepSeek-R1]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1.svg?style=social) | ⭐️⭐️ |\n\n### 📖DeepSeek\u002FMulti-head Latent Attention(MLA) ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"mla\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2024.05| 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04434) | [[DeepSeek-V2]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V2) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V2.svg?style=social)| ⭐️⭐️ |\n|2024.12|🔥🔥🔥[**DeepSeek-V3**] DeepSeek-V3 Technical Report(@deepseek-ai) | [[pdf]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3\u002Fblob\u002Fmain\u002FDeepSeek_V3.pdf) | [[DeepSeek-V3]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V3.svg?style=social) | ⭐️⭐️ |\n|2025.01|🔥🔥🔥[**DeepSeek-R1**] DeepSeek-R1 Technical Report(@deepseek-ai) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12948v1) | [[DeepSeek-R1]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1.svg?style=social) | ⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DeepSeek-NSA**] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention(@deepseek-ai)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.11089)| ⚠️|⭐️⭐️ |\n|2025.02|🔥🔥🔥[**FlashMLA**] DeepSeek FlashMLA(@deepseek-ai)|⚠️| [[FlashMLA]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FFlashMLA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FFlashMLA.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DualPipe**] DeepSeek DualPipe(@deepseek-ai)|⚠️| [[DualPipe]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDualPipe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDualPipe.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DeepEP**] DeepSeek DeepEP(@deepseek-ai)|⚠️| [[DeepEP]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepEP.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DeepGEMM**] DeepSeek DeepGEMM(@deepseek-ai)|⚠️| [[DeepGEMM]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepGEMM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepGEMM.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**EPLB**] DeepSeek EPLB(@deepseek-ai)|⚠️| [[EPLB]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FEPLB) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FEPLB.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**3FS**] DeepSeek 3FS(@deepseek-ai)|⚠️| [[3FS]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002F3FS) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002F3FS.svg?style=social) |⭐️⭐️ |\n|2025.03|🔥🔥🔥[**推理系统**] DeepSeek-V3 \u002F R1 推理系统概览 (@deepseek-ai) | [[blog]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F27181462601) | ⚠️|⭐️⭐️ |\n|2025.02|🔥🔥[**MHA2MLA**] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs(@fudan.edu.cn)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.14837)| [[MHA2MLA]](https:\u002F\u002Fgithub.com\u002FJT-Ushio\u002FMHA2MLA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJT-Ushio\u002FMHA2MLA.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥[**TransMLA**] TransMLA: Multi-head Latent Attention Is All You Need(@PKU)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.07864)|[[TransMLA]](https:\u002F\u002Fgithub.com\u002Ffxmeng\u002FTransMLA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffxmeng\u002FTransMLA.svg?style=social) | ⭐️⭐️ |\n|2025.03|🔥🔥[**X-EcoMLA**] X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression(@AMD)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.11132) |⚠️|⭐️⭐️ |\n\n### 📖Multi-GPUs\u002FMulti-Nodes Parallelism ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"DP-MP-PP-TP-SP-CP\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2019.10|🔥🔥[**MP: ZeRO**] DeepSpeed-ZeRO: Memory Optimizations Toward Training Trillion Parameter Models(@microsoft.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.02054)|  [[deepspeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2020.05|🔥🔥[**TP: Megatron-LM**] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1909.08053.pdf)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2022.05|🔥🔥[**SP: Megatron-LM**] Megatron-LM: Reducing Activation Recomputation in Large Transformer Models(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05198)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2023.05|🔥🔥[**SP: BPT**] Blockwise Parallel Transformer for Large Context Models(@UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19370)| [[RingAttention]](https:\u002F\u002Fgithub.com\u002Flhao499\u002FRingAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flhao499\u002FRingAttention.svg?style=social)|⭐️⭐️ |\n|2023.10|🔥🔥[**SP: Ring Attention**] Ring Attention with Blockwise Transformers for Near-Infinite Context(@UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01889.pdf)| [[RingAttention]](https:\u002F\u002Fgithub.com\u002Flhao499\u002FRingAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flhao499\u002FRingAttention.svg?style=social)|⭐️⭐️ |\n|2023.11|🔥🔥[**SP: STRIPED ATTENTION**] STRIPED ATTENTION: FASTER RING ATTENTION FOR CAUSAL TRANSFORMERS(@MIT etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.09431.pdf) |[[striped_attention]](https:\u002F\u002Fgithub.com\u002Fexists-forall\u002Fstriped_attention\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fexists-forall\u002Fstriped_attention.svg?style=social) |⭐️⭐️ |\n|2023.10|🔥🔥[**SP: DEEPSPEED ULYSSES**] DEEPSPEED ULYSSES: SYSTEM OPTIMIZATIONS FOR ENABLING TRAINING OF EXTREME LONG SEQUENCE TRANSFORMER MODELS(@microsoft.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.14509)|  [[deepspeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2024.03|🔥🔥[**CP: Megatron-LM**] Megatron-LM: Context parallelism overview(@NVIDIA)|[[docs]](https:\u002F\u002Fdocs.nvidia.com\u002Fmegatron-core\u002Fdeveloper-guide\u002Flatest\u002Fapi-guide\u002Fcontext_parallel.html)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2024.05|🔥🔥[**SP: Unified Sequence Parallel (USP)**] YunChang: A Unified Sequence Parallel (USP) Attention for Long Context LLM Model Training and Inference(@Tencent)|[[pdf]]()|[[long-context-attention]](https:\u002F\u002Fgithub.com\u002Ffeifeibear\u002Flong-context-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffeifeibear\u002Flong-context-attention.svg?style=social)|⭐️⭐️ |\n|2024.11|🔥🔥[**CP: Meta**] Context Parallelism for Scalable Million-Token Inference(@Meta Platforms, Inc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.01783)| ⚠️|⭐️⭐️ |\n|2024.11|🔥🔥[**TP: Comm Compression**] Communication Compression for Tensor Parallel LLM Inference(@recogni.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.09510)| ⚠️|⭐️⭐️ |\n|2024.11|🔥🔥🔥[**SP: Star-Attention, 11x~ speedup**] Star Attention: Efficient LLM Inference over Long Sequences(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.17116)|[[Star-Attention]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FStar-Attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FStar-Attention.svg?style=social)|⭐️⭐️ |\n|2024.12|🔥🔥[**SP: TokenRing**] TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication(@SJTU) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.20501)|[[token-ring]](https:\u002F\u002Fgithub.com\u002FACA-Lab-SJTU\u002Ftoken-ring) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FACA-Lab-SJTU\u002Ftoken-ring.svg?style=social)|⭐️⭐️ |\n|2025.05|🔥🔥[**FSDP 1\u002F2**] PyTorch FSDP: Getting Started with Fully Sharded Data Parallel(FSDP) (@pytorch) | [[docs]](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fintermediate\u002FFSDP_tutorial.html#getting-started-with-fully-sharded-data-parallel-fsdp) | ⚠️ |⭐️⭐️ |\n\n\n### 📖Disaggregating Prefill and Decoding ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"P-D-Disaggregating\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2024.01|🔥🔥[**DistServe**] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving(@PKU)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.09670)|[[DistServe]](https:\u002F\u002Fgithub.com\u002FLLMServe\u002FDistServe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLMServe\u002FDistServe.svg?style=social) |⭐️⭐️ |\n|2024.06|🔥🔥[**Mooncake**] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) |[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v1.pdf) |[[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2024.12|🔥🔥[**KVDirect**] KVDirect: Distributed Disaggregated LLM Inference(@ByteDance)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14743)|⚠️|⭐️ |\n|2025.01|🔥🔥[**DeServe**] DESERVE: TOWARDS AFFORDABLE OFFLINE LLM INFERENCE VIA DECENTRALIZATION(@Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14784)|⚠️|⭐️ |\n|2025.04|🔥🔥[**MegaScale-Infer**] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism(@ByteDance Seed) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.02263) |⚠️|⭐️ |\n\n### 📖LLM Algorithmic\u002FEval Survey ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"LLM-Algorithmic-Eval-Survey\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.10|[Evaluating] Evaluating Large Language Models: A Comprehensive Survey(@tju.edu.cn)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.19736.pdf)|[[Awesome-LLMs-Evaluation]](https:\u002F\u002Fgithub.com\u002Ftjunlp-lab\u002FAwesome-LLMs-Evaluation-Papers)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftjunlp-lab\u002FAwesome-LLMs-Evaluation-Papers.svg?style=social) |⭐️ |\n|2023.11|🔥[**Runtime Performance**] Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models(@hkust-gz.edu.cn) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03687.pdf)|⚠️|⭐️⭐️ |\n|2023.11|[ChatGPT Anniversary] ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?(@e.ntu.edu.sg)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16989.pdf)|⚠️|⭐️ |\n|2023.12|[Algorithmic Survey] The Efficiency Spectrum of Large Language Models: An Algorithmic Survey(@Microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00678.pdf)|⚠️|⭐️ |\n|2023.12|[Security and Privacy] A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly(@Drexel University)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02003.pdf)|⚠️|⭐️ |\n|2023.12|🔥[**LLMCompass**] A Hardware Evaluation Framework for Large Language Model Inference(@princeton.edu) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03134.pdf)|⚠️|⭐️⭐️ |\n|2023.12|🔥[**Efficient LLMs**] Efficient Large Language Models: A Survey(@Ohio State University etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03863.pdf)|[[Efficient-LLMs-Survey]](https:\u002F\u002Fgithub.com\u002FAIoT-MLSys-Lab\u002FEfficient-LLMs-Survey)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIoT-MLSys-Lab\u002FEfficient-LLMs-Survey.svg?style=social) |⭐️⭐️ |\n|2023.12|[**Serving Survey**] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems(@Carnegie Mellon University) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.15234.pdf)|⚠️|⭐️⭐️ |\n|2024.01|[Understanding LLMs] Understanding LLMs: A Comprehensive Overview from Training to Inference(@Shaanxi Normal University etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.02038.pdf) | ⚠️|⭐️⭐️ |\n|2024.02|[LLM-Viewer] LLM Inference Unveiled: Survey and Roofline Model Insights(@Zhihang Yuan etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.16363.pdf)|[[LLM-Viewer]](https:\u002F\u002Fgithub.com\u002Fhahnyuan\u002FLLM-Viewer)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhahnyuan\u002FLLM-Viewer.svg?style=social) |⭐️⭐️ |\n|2024.07|[**Internal Consistency & Self-Feedback**] Internal Consistency and Self-Feedback in Large Language Models: A Survey|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.14507)| [[ICSF-Survey]](https:\u002F\u002Fgithub.com\u002FIAAR-Shanghai\u002FICSFSurvey)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIAAR-Shanghai\u002FICSFSurvey.svg?style=social) | ⭐️⭐️ |\n|2024.09|[**Low-bit**] A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms(@Beihang etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.16694) | ⚠️|⭐️⭐️ |\n|2024.10|[**LLM Inference**] LARGE LANGUAGE MODEL INFERENCE ACCELERATION: A COMPREHENSIVE HARDWARE PERSPECTIVE(@SJTU etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.04466) | ⚠️|⭐️⭐️ |\n\n### 📖LLM Train\u002FInference Framework\u002FDesign ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"LLM-Train-Inference-Framework\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2020.05|🔥[**Megatron-LM**] Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1909.08053.pdf)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2023.03|[FlexGen] High-Throughput Generative Inference of Large Language Models  with a Single GPU(@Stanford University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.06865.pdf)|[[FlexGen]](https:\u002F\u002Fgithub.com\u002FFMInference\u002FFlexGen) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FFlexGen.svg?style=social)|⭐️ |\n|2023.05|[SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification(@Peking University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.09781.pdf)|[[FlexFlow]](https:\u002F\u002Fgithub.com\u002Fflexflow\u002FFlexFlow\u002Ftree\u002Finference) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fflexflow\u002FFlexFlow.svg?style=social)|⭐️ |\n|2023.05|[FastServe] Fast Distributed Inference Serving for Large Language Models(@Peking University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.05920.pdf)|⚠️|⭐️ |\n|2023.09|🔥[**vLLM**] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.06180.pdf)|[[vllm]](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvllm-project\u002Fvllm.svg?style=social)|⭐️⭐️ |\n|2023.09|[StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS(@Meta AI etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.17453.pdf)|[[streaming-llm]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fstreaming-llm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fstreaming-llm.svg?style=social)|⭐️ |\n|2023.09|[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc)|[[blog]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmedusa-llm)|[[Medusa]](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FMedusa.svg?style=social)|⭐️ |\n|2023.10|🔥[**TensorRT-LLM**] NVIDIA TensorRT LLM(@NVIDIA) |[[docs]](https:\u002F\u002Fnvidia.github.io\u002FTensorRT-LLM\u002F)|[[TensorRT-LLM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FTensorRT-LLM.svg?style=social) |⭐️⭐️ |\n|2023.11|🔥[**DeepSpeed-FastGen 2x vLLM?**] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.08671.pdf) | [[deepspeed-fastgen]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2023.12|🔥🔥[**SGLang**] Efficiently Programming Large Language Models using SGLang(@Stanford University etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07104)|[[sglang]](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsgl-project\u002Fsglang.svg?style=social) |⭐️⭐️ |\n|2023.12|🔥[**PETALS**] Distributed Inference and Fine-tuning of Large Language Models Over The Internet(@HSE Univesity etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08361.pdf)|[[petals]](https:\u002F\u002Fgithub.com\u002Fbigscience-workshop\u002Fpetals) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbigscience-workshop\u002Fpetals.svg?style=social)|⭐️⭐️ |\n|2023.10|[LightSeq] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers(@UC Berkeley etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03294.pdf)|[[LightSeq]](https:\u002F\u002Fgithub.com\u002FRulinShao\u002FLightSeq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRulinShao\u002FLightSeq.svg?style=social)|⭐️ |\n|2023.12|[PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU)|[[pdf]](https:\u002F\u002Fipads.se.sjtu.edu.cn\u002F_media\u002Fpublications\u002Fpowerinfer-20231219.pdf)|[[PowerInfer]](https:\u002F\u002Fgithub.com\u002FSJTU-IPADS\u002FPowerInfer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSJTU-IPADS\u002FPowerInfer.svg?style=social)|⭐️ |\n|2024.01|[inferflow]INFERFLOW: AN EFFICIENT AND HIGHLY CONFIGURABLE INFERENCE ENGINE FOR LARGE LANGUAGE MODELS(@Tencent AI Lab)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.08294.pdf) | [[inferflow]](https:\u002F\u002Fgithub.com\u002Finferflow\u002Finferflow) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finferflow\u002Finferflow.svg?style=social)|⭐️ |\n|2024.06|🔥[**Mooncake**] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) |[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v1.pdf) | [[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2023.06|🔥[**LMDeploy**] LMDeploy: LMDeploy is a toolkit for compressing, deploying, and serving LLMs(@InternLM) |[[docs]](https:\u002F\u002Flmdeploy.readthedocs.io\u002Fen\u002Flatest\u002F) | [[lmdeploy]](https:\u002F\u002Fgithub.com\u002FInternLM\u002Flmdeploy) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002Flmdeploy.svg?style=social)|⭐️⭐️ |\n|2023.05|🔥[**MLC-LLM**]Universal LLM Deployment Engine with ML Compilation(@mlc-ai) | [[docs]](https:\u002F\u002Fllm.mlc.ai\u002F) | [[mlc-llm]](https:\u002F\u002Fgithub.com\u002Fmlc-ai\u002Fmlc-llm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlc-ai\u002Fmlc-llm.svg?style=social)|⭐️⭐️ |\n|2023.08|🔥[**LightLLM**] LightLLM is a Python-based LLM (Large Language Model) inference and serving framework(@ModelTC) | [[docs]](https:\u002F\u002Fgithub.com\u002FModelTC\u002Flightllm) | [[lightllm]](https:\u002F\u002Fgithub.com\u002FModelTC\u002Flightllm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FModelTC\u002Flightllm.svg?style=social)|⭐️⭐️ |\n|2023.03|🔥[**llama.cpp**] llama.cpp: Inference of Meta's LLaMA model (and others) in pure C\u002FC++(@ggerganov) |[[docs]](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) | [[llama.cpp]](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fggerganov\u002Fllama.cpp.svg?style=social)|⭐️⭐️ |\n|2024.02|🔥[**flashinfer**] FlashInfer: Kernel Library for LLM Serving(@flashinfer-ai) |[[docs]](https:\u002F\u002Fflashinfer.ai\u002F2024\u002F02\u002F02\u002Fcascade-inference.html)|[[flashinfer]](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fflashinfer-ai\u002Fflashinfer.svg?style=social)|⭐️⭐️ |\n|2024.06|🔥[**Mooncake**] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) |[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v1.pdf) | [[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥[DynamoLLM] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency(@Microsoft Azure Research)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.00741)|⚠️|⭐️ |\n|2024.08|🔥[NanoFlow] NanoFlow: Towards Optimal Large Language Model Serving Throughput(@University of Washington)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.12757)|[[Nanoflow]](https:\u002F\u002Fgithub.com\u002Fefeslab\u002FNanoflow) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fefeslab\u002FNanoflow.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥[**Decentralized LLM**] Decentralized LLM Inference over Edge Networks with Energy Harvesting(@Padova)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15907)|⚠️|⭐️ |\n|2024.11| 🔥[**SparseInfer**] SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference(@University of Seoul, etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.12692)|⚠️|⭐️ |\n|2025.04|🔥[prima.cpp] PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters(@MBZUAI, etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.08791)|[[prima.cpp]](https:\u002F\u002Fgithub.com\u002FLizonghang\u002Fprima.cpp) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLizonghang\u002Fprima.cpp.svg?style=social)|⭐️|\n|2025.07|🔥[**siiRL**] DistFlow: A Fully Distributed RL Framework for Scalable and Efficient LLM Post-Training(@Shanghai Inovation Institute)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.13833)|[[siiRL]](https:\u002F\u002Fgithub.com\u002Fsii-research\u002FsiiRL)\u003Cbr> ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsii-research\u002FsiiRL.svg?style=social)|⭐️⭐️ | \n\n### 📖Continuous\u002FIn-flight Batching  ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Continuous-In-flight-Batching\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2022.07|🔥[**Continuous Batching**] Orca: A Distributed Serving System for Transformer-Based Generative Models(@Seoul National University etc) |[[pdf]](https:\u002F\u002Fwww.usenix.org\u002Fsystem\u002Ffiles\u002Fosdi22-yu.pdf)|⚠️|⭐️⭐️ |\n|2023.10|🔥[**In-flight Batching**] NVIDIA TensorRT LLM Batch Manager(@NVIDIA) |[[docs]](https:\u002F\u002Fnvidia.github.io\u002FTensorRT-LLM\u002Fbatch_manager.html)|[[TensorRT-LLM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FTensorRT-LLM.svg?style=social) |⭐️⭐️ |\n|2023.11|🔥[**DeepSpeed-FastGen 2x vLLM?**] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft)| [[blog]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed\u002Ftree\u002Fmaster\u002Fblogs\u002Fdeepspeed-fastgen) | [[deepspeed-fastgen]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2023.11|[Splitwise] Splitwise: Efficient Generative LLM Inference Using Phase Splitting(@Microsoft etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18677.pdf)|⚠️ |⭐️ |\n|2023.12|[SpotServe] SpotServe: Serving Generative Large Language Models on Preemptible Instances(@cmu.edu etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.15566.pdf)|[[SpotServe]](https:\u002F\u002Fgithub.com\u002FHsword\u002FSpotServe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHsword\u002FSpotServe.svg?style=social)|⭐️ |\n|2023.10|[LightSeq] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers(@UC Berkeley etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03294.pdf)|[[LightSeq]](https:\u002F\u002Fgithub.com\u002FRulinShao\u002FLightSeq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRulinShao\u002FLightSeq.svg?style=social)|⭐️ |\n|2024.05|🔥[vAttention] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention(@Microsoft Research India)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04437)|[[vAttention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fvattention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fvattention.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**vTensor**] vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving(@Shanghai Jiao Tong University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.15309)|[[vTensor]](https:\u002F\u002Fgithub.com\u002Fintelligent-machine-learning\u002Fglake\u002Ftree\u002Fmaster\u002FGLakeServe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fintelligent-machine-learning\u002Fglake.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥[Automatic Inference Engine Tuning] Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning(@Nanjing University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.04323)|⚠️|⭐️⭐️ |\n|2024.08|🔥[**SJF Scheduling**] Efficient LLM Scheduling by Learning to Rank(@UCSD etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15792)|⚠️|⭐️⭐️ |\n|2024.12|🔥[**BatchLLM**] BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03594)|⚠️|⭐️⭐️ |\n\n### 📖Weight\u002FActivation Quantize\u002FCompress ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Weight-Activation-Quantize-Compress\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2022.06|🔥[**ZeroQuant**] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers(@Microsoft) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01861.pdf)|[[DeepSpeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social)|⭐️⭐️ |\n|2022.08|[FP8-Quantization] FP8 Quantization: The Power of the Exponent(@Qualcomm AI Research) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09225.pdf) | [[FP8-quantization]](https:\u002F\u002Fgithub.com\u002FQualcomm-AI-research\u002FFP8-quantization) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQualcomm-AI-research\u002FFP8-quantization.svg?style=social) |⭐️ |\n|2022.08|[LLM.int8()] 8-bit Matrix Multiplication  for Transformers at Scale(@Facebook AI Research etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07339.pdf)|[[bitsandbytes]](https:\u002F\u002Fgithub.com\u002Ftimdettmers\u002Fbitsandbytes) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftimdettmers\u002Fbitsandbytes.svg?style=social)|⭐️ |\n|2022.10|🔥[**GPTQ**] GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS(@IST Austria etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.17323.pdf) |[[gptq]](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fgptq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fgptq.svg?style=social)|⭐️⭐️ |\n|2022.11|🔥[**WINT8\u002F4**] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production(@NVIDIA&Microsoft) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10017.pdf)|[[FasterTransformer]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FFasterTransformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FFasterTransformer.svg?style=social)|⭐️⭐️ |\n|2022.11|🔥[**SmoothQuant**] Accurate and Efficient Post-Training Quantization for Large Language Models(@MIT etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10438.pdf)|[[smoothquant]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fsmoothquant) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fsmoothquant.svg?style=social)|⭐️⭐️ |\n|2023.03|[ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.08302.pdf)|[[DeepSpeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social)|⭐️ |\n|2023.06|🔥[**AWQ**] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration(@MIT etc)|[[pdf]](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2306.00978.pdf)|[[llm-awq]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fllm-awq.svg?style=social)|⭐️⭐️ |\n|2023.06|[SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression(@University of Washington etc)|[[pdf]](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2306.03078.pdf)|[[SpQR]](https:\u002F\u002Fgithub.com\u002FVahe1994\u002FSpQR) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVahe1994\u002FSpQR.svg?style=social)|⭐️ |\n|2023.06|[SqueezeLLM] SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION(@berkeley.edu) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.07629.pdf) | [[SqueezeLLM]](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezeLLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FSqueezeLLM.svg?style=social) |⭐️ |\n|2023.07|[ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.09782.pdf)|[[DeepSpeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social)|⭐️ |\n|2023.09|[KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI) | [[blog]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F653735572)|⚠️|⭐️ |\n|2023.10|[FP8-LM] FP8-LM: Training FP8 Large Language Models(@Microsoft etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.18313.pdf)| [[MS-AMP]](https:\u002F\u002Fgithub.com\u002FAzure\u002FMS-AMP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAzure\u002FMS-AMP.svg?style=social) |⭐️ |\n|2023.10|[LLM-Shearing] SHEARED LLAMA: ACCELERATING LANGUAGE MODEL PRE-TRAINING VIA STRUCTURED PRUNING(@cs.princeton.edu etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.06694.pdf) | [[LLM-Shearing]](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-nlp\u002FLLM-Shearing.svg?style=social)  |⭐️ |\n|2023.10|[LLM-FP4] LLM-FP4: 4-Bit Floating-Point Quantized Transformers(@ust.hk&meta etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16836.pdf) | [[LLM-FP4]](https:\u002F\u002Fgithub.com\u002Fnbasyl\u002FLLM-FP4) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnbasyl\u002FLLM-FP4.svg?style=social) |⭐️ |\n|2023.11|[2-bit LLM] Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization(@Shanghai Jiao Tong University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16442.pdf)|⚠️ |⭐️ |\n|2023.12|[**SmoothQuant+**] SmoothQuant+: Accurate and Efficient 4-bit Post-Training Weight Quantization for LLM(@ZTE Corporation)  | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03788.pdf) | [[smoothquantplus]](https:\u002F\u002Fgithub.com\u002FAdlik\u002Fsmoothquantplus) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAdlik\u002Fsmoothquantplus.svg?style=social) |⭐️ |\n|2023.11|[OdysseyLLM W4A8] A Speed Odyssey for Deployable Quantization of LLMs(@meituan.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.09550.pdf)|⚠️|⭐️ |\n|2023.12|🔥[**SparQ**] SPARQ ATTENTION: BANDWIDTH-EFFICIENT LLM INFERENCE(@graphcore.ai)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.04985.pdf)|⚠️|⭐️⭐️ |\n|2023.12|[Agile-Quant] Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge(@Northeastern University&Oracle)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.05693.pdf)|⚠️|⭐️ |\n|2023.12|[CBQ] CBQ: Cross-Block Quantization for Large Language Models(@ustc.edu.cn)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07950.pdf)|⚠️|⭐️ |\n|2023.10|[QLLM] QLLM: ACCURATE AND EFFICIENT LOW-BITWIDTH QUANTIZATION FOR LARGE LANGUAGE MODELS(@ZIP Lab&SenseTime Research etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.08041.pdf)|⚠️|⭐️ |\n|2024.01|[FP6-LLM] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design(@Microsoft etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.14112.pdf)|⚠️|⭐️ |\n|2024.05|🔥🔥[**W4A8KV4**] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving(@MIT&NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04532)|[[qserve]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fqserve) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fqserve.svg?style=social) |⭐️⭐️ |\n|2024.05|🔥[SpinQuant] SpinQuant: LLM Quantization with Learned Rotations(@Meta)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.16406)|⚠️|⭐️ |\n|2024.05|🔥[I-LLM] I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models(@Houmo AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.17849)|⚠️|⭐️ |\n|2024.06|🔥[OutlierTune] OutlierTune: Efficient Channel-Wise Quantization for Large Language Models(@Beijing University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.18832)|⚠️|⭐️ |\n|2024.06|🔥[GPTQT] GPTQT: Quantize Large Language Models Twice to Push the Efficiency(@zju)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.02891)|⚠️|⭐️ |\n|2024.08|🔥[ABQ-LLM] ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models(@ByteDance)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.08554)|[[ABQ-LLM]](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FABQ-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FABQ-LLM.svg?style=social)|⭐️ |\n|2024.08|🔥[1-bit LLMs] Matmul or No Matmal in the Era of 1-bit LLMs(@University of South Carolina)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11939)|⚠️|⭐️ |\n|2024.08|🔥[ACTIVATION SPARSITY] TRAINING-FREE ACTIVATION SPARSITY IN LARGE LANGUAGE MODELS(@MIT etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.14690)|[[TEAL]](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FTEAL) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FTEAL.svg?style=social)|⭐️ |\n|2024.09|🔥[VPTQ] VPTQ: EXTREME LOW-BIT VECTOR POST-TRAINING QUANTIZATION FOR LARGE LANGUAGE MODELS(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.17066)|[[VPTQ]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVPTQ) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FVPTQ.svg?style=social)|⭐️ |\n|2024.11|🔥[BitNet] BitNet a4.8: 4-bit Activations for 1-bit LLMs(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.04965)|[[bitnet]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Fbitnet) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social)|⭐️ |\n|2025.04|🔥[**BitNet v2**] BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.18415)|[[bitnet]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Fbitnet) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social)|⭐️ |\n|2025.05|🔥[**GuidedQuant**] GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance (@SNU&SamsungAILab&Google) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.07004) |[[GuidedQuant]](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FGuidedQuant) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FGuidedQuant.svg?style=social)|⭐️⭐️ |\n\n### 📖IO\u002FFLOPs-Aware\u002FSparse Attention ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"IO-FLOPs-Aware-Attention-Sparse\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2018.05| [Online Softmax] Online normalizer calculation for softmax(@NVIDIA) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1805.02867.pdf)|⚠️|⭐️ |\n|2019.11|🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1911.02150.pdf)|⚠️|⭐️⭐️ |\n|2020.10|[Hash Attention] REFORMER: THE EFFICIENT TRANSFORMER(@Google)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2001.04451.pdf)|[[reformer]](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Ftrax\u002Ftree\u002Fmaster\u002Ftrax\u002Fmodels\u002Freformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Ftrax.svg?style=social)|⭐️⭐️ |\n|2022.05|🔥[**FlashAttention**] Fast and Memory-Efficient Exact Attention with IO-Awareness(@Stanford University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14135.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2022.10|[Online Softmax] SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY(@Google)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05682.pdf) | ⚠️ |⭐️ |\n|2023.05|[FlashAttention] From Online Softmax to FlashAttention(@cs.washington.edu)|[[pdf]](https:\u002F\u002Fcourses.cs.washington.edu\u002Fcourses\u002Fcse599m\u002F23sp\u002Fnotes\u002Fflashattn.pdf)|⚠️|⭐️⭐️ |\n|2023.05|[FLOP, I\u002FO] Dissecting Batching Effects in GPT Inference(@Lequn Chen) | [[blog]](https:\u002F\u002Fle.qun.ch\u002Fen\u002Fblog\u002F2023\u002F05\u002F13\u002Ftransformer-batching\u002F) | ⚠️ |⭐️ |\n|2023.05|🔥🔥[**GQA**] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13245.pdf)|[[flaxformer]](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fflaxformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Fflaxformer.svg?style=social) |⭐️⭐️ |\n|2023.06|[Sparse FlashAttention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.01160.pdf) | [[dynamic-sparse-flash-attention]](https:\u002F\u002Fgithub.com\u002Fepfml\u002Fdynamic-sparse-flash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fepfml\u002Fdynamic-sparse-flash-attention.svg?style=social)|⭐️ |\n|2023.07|🔥[**FlashAttention-2**] Faster Attention with Better Parallelism and Work Partitioning(@Stanford University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.08691.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2023.10|🔥[**Flash-Decoding**] Flash-Decoding for long-context inference(@Stanford University etc)|[[blog]](https:\u002F\u002Fcrfm.stanford.edu\u002F2023\u002F10\u002F12\u002Fflashdecoding.html)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2023.11|[Flash-Decoding++] FLASHDECODING++: FASTER LARGE LANGUAGE MODEL INFERENCE ON GPUS(@Tsinghua University&Infinigence-AI) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01282.pdf) | ⚠️ |⭐️ |\n|2023.01|[SparseGPT] SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot(@ISTA etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.00774.pdf)| [[sparsegpt]](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fsparsegpt) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fsparsegpt.svg?style=social) |⭐️ |\n|2023.12|🔥[**GLA**] Gated Linear Attention Transformers with Hardware-Efficient Training(@MIT-IBM Watson AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06635.pdf)|[gated_linear_attention](https:\u002F\u002Fgithub.com\u002Fberlino\u002Fgated_linear_attention)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fberlino\u002Fgated_linear_attention.svg?style=social)|⭐️⭐️ |\n|2023.12|[SCCA] SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion(@Beihang University)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07305.pdf) | ⚠️ |⭐️ |\n|2023.12|🔥[**FlashLLM**] LLM in a flash: Efficient Large Language Model Inference with Limited Memory(@Apple)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11514.pdf) | ⚠️ |⭐️⭐️ |\n|2024.03|🔥🔥[CHAI] CHAI: Clustered Head Attention for Efficient LLM Inference(@cs.wisc.edu etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.08058.pdf) | ⚠️ |⭐️⭐️ |\n|2024.04|🔥🔥[DeFT] DeFT: Decoding with Flash Tree-Attention for Efficient Tree-structured LLM Inference(@Westlake University etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.00242) | ⚠️ |⭐️⭐️ |\n|2024.04|[MoA] MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression(@thu et el.)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.14909) | [[MoA]](https:\u002F\u002Fgithub.com\u002Fthu-nics\u002FMoA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-nics\u002FMoA.svg?style=social) | ⭐️ |\n|2024.07|🔥🔥[**FlashAttention-3**] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) |[[pdf]](https:\u002F\u002Ftridao.me\u002Fpublications\u002Fflash3\u002Fflash3.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**MInference 1.0**] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.02490)|[[MInference 1.0]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[Shared Attention] Beyond KV Caching: Shared Attention for Efficient LLMs(@Kyushu University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.12866) | [[shareAtt]](https:\u002F\u002Fgithub.com\u002Fmetacarbon\u002FshareAtt) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmetacarbon\u002FshareAtt.svg?style=social) | ⭐️ |\n|2024.09|🔥🔥[**CHESS**] CHESS : Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification(@Wuhan University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.01366) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥🔥[INT-FLASHATTENTION] INT-FLASHATTENTION: ENABLING FLASH ATTENTION FOR INT8 QUANTIZATION(@PKU etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.16997)| [[INT-FlashAttention]](https:\u002F\u002Fgithub.com\u002FINT-FlashAttention2024\u002FINT-FlashAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FINT-FlashAttention2024\u002FINT-FlashAttention.svg?style=social) | ⭐️ |\n|2024.10|🔥🔥[**SageAttention**] SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.02367)|[[SageAttention]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSageAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSageAttention) | ⭐️⭐️ |\n|2024.11|🔥🔥[**SageAttention-2**] SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.10958)|[[SageAttention]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSageAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSageAttention) | ⭐️⭐️ |\n|2024.11|🔥🔥[**Squeezed Attention**] SQUEEZED ATTENTION: Accelerating Long Context Length LLM Inference(@UC Berkeley) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.09688)|[[SqueezedAttention]](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezedAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FSqueezedAttention) | ⭐️⭐️ |\n|2024.12|🔥🔥[**TurboAttention**] TURBOATTENTION: EFFICIENT ATTENTION APPROXIMATION FOR HIGH THROUGHPUTS LLMS(@Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.08585)| ⚠️ |⭐️⭐️ |\n|2025.01|🔥🔥[**FFPA**] FFPA: Yet another Faster Flash Prefill Attention with O(1) SRAM complexity for headdim > 256, ~1.5x faster than SDPA EA(@xlite-dev)|[[docs]](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)| [[ffpa-attn]](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️ |\n|2025.03|🔥🔥[**SpargeAttention**] SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.18137)|[[SpargeAttn]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSpargeAttn) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSpargeAttn) | ⭐️⭐️ |\n|2025.04|🔥🔥[**MMInference**] MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention(@microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.16083)|[[MInference]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference) | ⭐️⭐️ |\n|2025.04|🔥🔥[**Sparse Frontier**] The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs (@Cohere) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.17768)|[[SparseFrontier]](https:\u002F\u002Fgithub.com\u002FPiotrNawrot\u002Fsparse-frontier) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPiotrNawrot\u002Fsparse-frontier) | ⭐️⭐️ |\n|2024.12|🔥🔥[**Flex Attention**] FLEX ATTENTION: A PROGRAMMING MODEL FOR GENERATING OPTIMIZED ATTENTION KERNELS(@pytorch) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05496)|[[attention-gym]](https:\u002F\u002Fgithub.com\u002Fpytorch-labs\u002Fattention-gym) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpytorch-labs\u002Fattention-gym) | ⭐️⭐️ |\n|2025.02| 🔥🔥🔥[**SeerAttention**] SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs(@microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13276) | [[SeerAttention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSeerAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FSeerAttention.svg?style=social) | ⭐️⭐️⭐️ |\n|2025.03| [**Slim attention**] Slim attention: cut your context memory in half without loss of accuracy, K-cache is all you need for MHA(@OpenMachine.ai) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.05840) | [[OpenMchine]](https:\u002F\u002Fgithub.com\u002FOpenMachine-ai\u002Ftransformer-tricks) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenMachine-ai\u002Ftransformer-tricks.svg?style=social) | ⭐️⭐️⭐️ |\n|2025.05|🔥🔥[**SageAttention-3**] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.11594)|[[SageAttention]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSageAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSageAttention) | ⭐️⭐️ |\n|2025.04|🔥🔥[**Parallel Encoding**] APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding(@cmu.edu&NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.05431)|[[APE]](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FAPE) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInfini-AI-Lab\u002FAPE) | ⭐️⭐️ |\n|2025.04|🔥🔥[**Parallel Encoding**] Block-Attention for Efficient Prefilling(@Tencent etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.15355)|[[Block-attention]](https:\u002F\u002Fgithub.com\u002FTemporaryLoRA\u002FBlock-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTemporaryLoRA\u002FBlock-attention) | ⭐️⭐️ |\n\n### 📖KV Cache Scheduling\u002FQuantize\u002FDropping ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"KV-Cache-Scheduling-Quantize-Dropping\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2019.11|🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1911.02150.pdf)|⚠️|⭐️⭐️ |\n|2022.06|[LTP] Learned Token Pruning for Transformers(@UC Berkeley etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.00910.pdf)|[[LTP]](https:\u002F\u002Fgithub.com\u002Fkssteven418\u002FLTP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkssteven418\u002FLTP.svg?style=social)|⭐️ |\n|2023.05|🔥🔥[**GQA**] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13245.pdf)|[[flaxformer]](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fflaxformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Fflaxformer.svg?style=social) |⭐️⭐️ |\n|2023.05|[KV Cache Compress] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time(@)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.17118.pdf)|⚠️|⭐️⭐️ |\n|2023.06|[H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models(@Rice University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14048.pdf)|[[H2O]](https:\u002F\u002Fgithub.com\u002FFMInference\u002FH2O) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FH2O.svg?style=social) |⭐️ |\n|2023.06|[QK-Sparse\u002FDropping Attention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.01160.pdf) | [[dynamic-sparse-flash-attention]](https:\u002F\u002Fgithub.com\u002Fepfml\u002Fdynamic-sparse-flash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fepfml\u002Fdynamic-sparse-flash-attention.svg?style=social)|⭐️ |\n|2023.08|🔥🔥[Chunked Prefills] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills(@Microsoft etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16369.pdf)|⚠️|⭐️⭐️ |\n|2023.09|🔥🔥[**PagedAttention**] Efficient Memory Management for Large Language  Model Serving with PagedAttention(@UC Berkeley etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.06180.pdf)|[[vllm]](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvllm-project\u002Fvllm.svg?style=social)|⭐️⭐️ |\n|2023.09|[KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI) | [[blog]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F653735572)|⚠️|⭐️ |\n|2023.10|🔥[**TensorRT-LLM KV Cache FP8**] NVIDIA TensorRT LLM(@NVIDIA) |[[docs]](https:\u002F\u002Fnvidia.github.io\u002FTensorRT-LLM\u002Fprecision.html)|[[TensorRT-LLM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FTensorRT-LLM.svg?style=social) |⭐️⭐️ |\n|2023.10|🔥[**Adaptive KV Cache Compress**] MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS(@illinois.edu&microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01801.pdf)|⚠️|⭐️⭐️ |\n|2023.10|[CacheGen] CacheGen: Fast Context Loading for Language Model Applications(@Chicago University&Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.07240.pdf)|[[LMCache]](https:\u002F\u002Fgithub.com\u002FLMCache\u002FLMCache) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLMCache\u002FLMCache.svg?style=social)|⭐️ |\n|2023.12|[KV-Cache Optimizations] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO(@Haim Barad etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04951.pdf)|⚠️|⭐️ |\n|2023.12|[KV Cache Compress with LoRA] Compressed Context Memory for Online Language Model Interaction (@SNU & NAVER AI) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03414.pdf)|[[Compressed-Context-Memory]](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FContext-Memory) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FContext-Memory.svg?style=social) |⭐️⭐️ |\n|2023.12|🔥🔥[**RadixAttention**] Efficiently Programming Large Language Models using SGLang(@Stanford University etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07104)|[[sglang]](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsgl-project\u002Fsglang.svg?style=social) |⭐️⭐️ |\n|2024.01|🔥🔥[**DistKV-LLM**] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache(@Alibaba etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.02669.pdf)|⚠️|⭐️⭐️ |\n|2024.02|🔥🔥[Prompt Caching] Efficient Prompt Caching via Embedding Similarity(@UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01173.pdf)|⚠️|⭐️⭐️ |\n|2024.02|🔥🔥[Less] Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference(@CMU etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.09398.pdf)|⚠️|⭐️ |\n|2024.02|🔥🔥[MiKV] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization(@KAIST)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.18096.pdf)|⚠️|⭐️ |\n|2024.02|🔥🔥[**Shared Prefixes**] Hydragen: High-Throughput LLM Inference with Shared Prefixes | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.05099.pdf)|⚠️|⭐️⭐️ |\n|2024.02|🔥🔥[**ChunkAttention**] ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition(@microsoft.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.15220)|[[chunk-attention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fchunk-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fchunk-attention.svg?style=social) |⭐️⭐️ |\n|2024.03|🔥[QAQ] QAQ: Quality Adaptive Quantization for LLM KV Cache(@@smail.nju.edu.cn)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.04643.pdf)|[[QAQ-KVCacheQuantization]](https:\u002F\u002Fgithub.com\u002FClubieDong\u002FQAQ-KVCacheQuantization) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FClubieDong\u002FQAQ-KVCacheQuantization.svg?style=social) |⭐️⭐️ |\n|2024.03|🔥🔥[DMC] Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference(@NVIDIA etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.09636.pdf)|⚠️|⭐️⭐️ |\n|2024.03|🔥🔥[Keyformer] Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference(@ece.ubc.ca etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.09054.pdf)|[[Keyformer]](https:\u002F\u002Fgithub.com\u002Fd-matrix-ai\u002Fkeyformer-llm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fd-matrix-ai\u002Fkeyformer-llm.svg?style=social)|⭐️⭐️ |\n|2024.03|[FASTDECODE] FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous(@Tsinghua University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.11421.pdf)|⚠️|⭐️⭐️ |\n|2024.03|[Sparsity-Aware KV Caching] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching(@ucf.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.17312.pdf)|⚠️|⭐️⭐️ |\n|2024.03|🔥[GEAR] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM(@gatech.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.05527)|[[GEAR]](https:\u002F\u002Fgithub.com\u002Fopengear-project\u002FGEAR) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopengear-project\u002FGEAR.svg?style=social)|⭐️ |\n|2024.04|[SqueezeAttention] SQUEEZEATTENTION: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget(@lzu.edu.cn etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.04793.pdf)|[[SqueezeAttention]](https:\u002F\u002Fgithub.com\u002Fhetailang\u002FSqueezeAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhetailang\u002FSqueezeAttention.svg?style=social) |⭐️⭐️ |\n|2024.04|[SnapKV] SnapKV: LLM Knows What You are Looking for Before Generation(@UIUC)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.14469)|[[SnapKV]](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FSnapKV) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FSnapKV.svg?style=social)|⭐️ |\n|2024.05|🔥[vAttention] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention(@Microsoft Research India)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04437)|[[vAttention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fvattention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fvattention.svg?style=social)|⭐️⭐️ |\n|2024.05|🔥[KVCache-1Bit] KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization(@Rice University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.03917)|⚠️|⭐️⭐️ |\n|2024.05|🔥[KV-Runahead] KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation(@Apple etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05329)|⚠️|⭐️⭐️ |\n|2024.05|🔥[ZipCache] ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification(@Zhejiang University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14256)|⚠️|⭐️⭐️ |\n|2024.05|🔥[MiniCache] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models(@ZIP Lab)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14366)|⚠️|⭐️⭐️ |\n|2024.05|🔥[CacheBlend] CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion(@University of Chicago)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.16444)|[[LMCache]](https:\u002F\u002Fgithub.com\u002FLMCache\u002FLMCache) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLMCache\u002FLMCache.svg?style=social)|⭐️⭐️ |\n|2024.06|🔥[CompressKV] Effectively Compress KV Heads for LLM(@alibaba etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.07056)|⚠️|⭐️⭐️ |\n|2024.06|🔥[MemServe] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool(@Huawei Cloud etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.17565)|⚠️|⭐️⭐️ |\n|2024.07|🔥[MLKV] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding(@Institut Teknologi Bandung)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.09297)|[[pythia-mlkv]](https:\u002F\u002Fgithub.com\u002Fzaydzuhri\u002Fpythia-mlkv) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzaydzuhri\u002Fpythia-mlkv.svg?style=social)|⭐️ |\n|2024.07|🔥[ThinK] ThinK: Thinner Key Cache by Query-Driven Pruning(@Salesforce AI Research etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.21018)|⚠️|⭐️⭐️ |\n|2024.07|🔥[Palu] Palu: Compressing KV-Cache with Low-Rank Projection(@nycu.edu.tw)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.21118)|[[Palu]](https:\u002F\u002Fgithub.com\u002Fshadowpa0327\u002FPalu) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshadowpa0327\u002FPalu.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥[Zero-Delay QKV Compression] Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference(@University of Virginia)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.04107)|⚠️|⭐️⭐️ |\n|2024.09|🔥[**AlignedKV**] AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization(@Tsinghua University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.16546)|[[AlignedKV]](https:\u002F\u002Fgithub.com\u002FAlignedQuant\u002FAlignedKV) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlignedQuant\u002FAlignedKV.svg?style=social)|⭐️ |\n|2024.10|🔥[**LayerKV**] Optimizing Large Language Model Serving with Layer-wise KV Cache Management(@Ant Group)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.00428)|⚠️|⭐️⭐️ |\n|2024.10|🔥[**AdaKV**] Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference (@USTC)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11550)|[[AdaKV]](https:\u002F\u002Fgithub.com\u002FFFY0\u002FAdaKV) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFFY0\u002FAdaKV.svg?style=social&label=Star)|⭐️⭐️|\n|2024.11|🔥[**KV Cache Recomputation**] Efficient LLM Inference with I\u002FO-Aware Partial KV Cache Recomputation(@University of Southern California)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.17089)|⚠️|⭐️⭐️ |\n|2024.12|🔥[**ClusterKV**] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression(@sjtu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03213)|⚠️|⭐️⭐️ |\n|2024.12|🔥[**DynamicKV**] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs(@xiabinzhou0625 etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.14838)|⚠️|⭐️⭐️ |\n|2025.02|🔥[**DynamicLLaVA**] [ICLR2025] Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification (@ECNU, Xiaohongshu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.00876)|[[DynamicLLaVA]](https:\u002F\u002Fgithub.com\u002FOsilly\u002Fdynamic_llava) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOsilly\u002Fdynamic_llava.svg?style=social&label=Star)|⭐️⭐️|\n|2025.02|🔥[**CacheCraft**] Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation(@Adobe Research)|[[pdf]](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2502.15734)|⚠️|⭐️⭐️ |\n|2025.04|🔥[**KV Cache Prefetch**] Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching(@Alibaba)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.06319)|⚠️|⭐️⭐️ |\n|2025.05|🔥[**KVzip**] KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction (@SNU)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23416)|[[KVzip]](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FKVzip) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FKVzip.svg?style=social&label=Star)|⭐️⭐️|\n|2025.06|🔥🔥[**Inference-Time Hyper-Scaling**] Inference-Time Hyper-Scaling with KV Cache Compression (@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.05345)|⚠️|⭐️⭐️ |\n|2026.03|[**AVP**] Agent Vector Protocol: Cross-Model KV-Cache Transfer via Vocabulary-Mediated Projection (@VectorArc)|[[spec]](https:\u002F\u002Fgithub.com\u002FVectorArc\u002Favp-spec)|[[avp-python]](https:\u002F\u002Fgithub.com\u002FVectorArc\u002Favp-python) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVectorArc\u002Favp-python.svg?style=social)|⭐️⭐️ |\n\n### 📖Prompt\u002FContext\u002FKV Compression ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Context-Compression\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.04|🔥[**Selective-Context**] Compressing Context to Enhance Inference Efficiency of Large Language Models(@Surrey) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.06201.pdf)|[Selective-Context](https:\u002F\u002Fgithub.com\u002Fliyucheng09\u002FSelective_Context)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliyucheng09\u002FSelective_Context.svg?style=social)|⭐️⭐️ |\n|2023.05|[**AutoCompressor**] Adapting Language Models to Compress Contextss(@Princeton) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14788.pdf)|[AutoCompressor](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FAutoCompressors)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-nlp\u002FAutoCompressors.svg?style=social)|⭐️ |\n|2023.10|🔥[**LLMLingua**] LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models(@Microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.05736.pdf)|[LLMLingua](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua.svg?style=social)|⭐️⭐️ |\n|2023.10|🔥🔥[**LongLLMLingua**] LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression(@Microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06839)|[LLMLingua](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua.svg?style=social)|⭐️⭐️ |\n|2024.03|🔥[**LLMLingua-2**] LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression(@Microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.12968.pdf)|[LLMLingua series](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua.svg?style=social)|⭐️ |\n|2024.08|🔥🔥[**500xCompressor**] 500xCompressor: Generalized Prompt Compression for Large Language Models(@University of Cambridge) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.03094) | ⚠️ |⭐️⭐️ |\n|2024.08|🔥🔥[**Eigen Attention**] Eigen Attention: Attention in Low-Rank Space for KV Cache Compression(@purdue.edu) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.05646) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥🔥[**Prompt Compression**] Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference(@Alterra AI)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.01227) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥🔥[**Context Distillation**] Efficient LLM Context Distillation(@gatech.edu)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.01930) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥🔥[**CRITIPREFILL**] CRITIPREFILL: A SEGMENT-WISE CRITICALITYBASED APPROACH FOR PREFILLING ACCELERATION IN LLMS(@OPPO) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.12490) | [CritiPrefill](https:\u002F\u002Fgithub.com\u002F66RING\u002FCritiPrefill)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F66RING\u002FCritiPrefill.svg?style=social)|⭐️ |\n|2024.10|🔥🔥[**KV-COMPRESS**] PAGED KV-CACHE COMPRESSION WITH VARIABLE COMPRESSION RATES PER ATTENTION HEAD(@Cloudflare, inc.)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.00161) | [vllm-kvcompress](https:\u002F\u002Fgithub.com\u002FIsaacRe\u002Fvllm-kvcompress) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIsaacRe\u002Fvllm-kvcompress.svg?style=social)|⭐️⭐️ |\n|2024.10|🔥🔥[**LORC**] Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy(@gatech.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.03111)|⚠️ |⭐️⭐️ |\n|2025.11|🔥🔥[**KVTC**] KV Cache Transform Coding for Compact Storage in LLM Inference (@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.01815)|⚠️|⭐️⭐️ |\n\n### 📖Long Context Attention\u002FKV Cache Optimization ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Long-Context-Attention-KVCache\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.05|🔥🔥[**Blockwise Attention**] Blockwise Parallel Transformer for Large Context Models(@UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19370.pdf) | ⚠️ |⭐️⭐️ |\n|2023.05|🔥[Landmark Attention] Random-Access Infinite Context Length for Transformers(@epfl.ch)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16300.pdf)|[landmark-attention](https:\u002F\u002Fgithub.com\u002Fepfml\u002Flandmark-attention\u002F)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fepfml\u002Flandmark-attention.svg?style=social)|⭐️⭐️ |\n|2023.07|🔥[**LightningAttention-1**] TRANSNORMERLLM: A FASTER AND BETTER LARGE LANGUAGE MODEL WITH IMPROVED TRANSNORMER(@OpenNLPLab)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.14995.pdf)|[TransnormerLLM](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FTransnormerLLM)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenNLPLab\u002FTransnormerLLM.svg?style=social)|⭐️⭐️ |\n|2023.07|🔥[**LightningAttention-2**] Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models(@OpenNLPLab)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.04658.pdf)|[lightning-attention](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002Flightning-attention)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenNLPLab\u002Flightning-attention.svg?style=social)|⭐️⭐️ |\n|2023.10|🔥🔥[**RingAttention**] Ring Attention with Blockwise Transformers for Near-Infinite Context(@UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01889.pdf)| [[RingAttention]](https:\u002F\u002Fgithub.com\u002Flhao499\u002FRingAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flhao499\u002FRingAttention.svg?style=social)|⭐️⭐️ |\n|2023.11|🔥[**HyperAttention**] HyperAttention: Long-context Attention in Near-Linear Time(@yale&Google)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.05869.pdf)|[hyper-attn](https:\u002F\u002Fgithub.com\u002Finsuhan\u002Fhyper-attn)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finsuhan\u002Fhyper-attn.svg?style=social)|⭐️⭐️ |\n|2023.11|[**Streaming Attention**] One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space(@Adobe Research etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.14652.pdf)|⚠️ |⭐️ |\n|2023.11|🔥[**Prompt Cache**] PROMPT CACHE: MODULAR ATTENTION REUSE FOR LOW-LATENCY INFERENCE(@Yale University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04934.pdf)|⚠️|⭐️⭐️ |\n|2023.11|🔥🔥[**StripedAttention**] STRIPED ATTENTION: FASTER RING ATTENTION FOR CAUSAL TRANSFORMERS(@MIT etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.09431.pdf) |[[striped_attention]](https:\u002F\u002Fgithub.com\u002Fexists-forall\u002Fstriped_attention\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fexists-forall\u002Fstriped_attention.svg?style=social) |⭐️⭐️ |\n|2024.01|🔥🔥[**KVQuant**] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization(@UC Berkeley)|[[pdf]](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2401.18079.pdf)|[[KVQuant]](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FKVQuant\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FKVQuant.svg?style=social) |⭐️⭐️ |\n|2024.02|🔥[**RelayAttention**] RelayAttention for Efficient Large Language Model Serving with Long System Prompts(@sensetime.com etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.14808.pdf) | ⚠️ |⭐️⭐️ |\n|2024.04|🔥🔥[Infini-attention] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.07143.pdf) | ⚠️ |⭐️⭐️ |\n|2024.04|🔥🔥[RAGCache] RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation(@Peking University&ByteDance Inc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.12457.pdf) | ⚠️ |⭐️⭐️ |\n|2024.04|🔥🔥[**KCache**] EFFICIENT LLM INFERENCE WITH KCACHE(@Qiaozhi He, Zhihua Wu)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.18057) | ⚠️ |⭐️⭐️ |\n|2024.04|🔥[**HOMER**] Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs(@KAIST)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.10308)|[[HOMER]](https:\u002F\u002Fgithub.com\u002Falinlab\u002FHOMER) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falinlab\u002FHOMER?style=social) |⭐️⭐️ |\n|2024.05|🔥🔥[**YOCO**] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05254) | [[unilm-YOCO]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002FYOCO) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social) |⭐️⭐️ |\n|2024.05|🔥🔥[SKVQ] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models(@Shanghai AI Laboratory)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.06219) | ⚠️ |⭐️⭐️ |\n|2024.05|🔥🔥[**CLA**] Reducing Transformer Key-Value Cache Size with Cross-Layer Attention(@MIT-IBM)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.12981) | ⚠️ |⭐️⭐️ |\n|2024.06|🔥[LOOK-M] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference(@osu.edu etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.18139) | [[LOOK-M]](https:\u002F\u002Fgithub.com\u002FSUSTechBruce\u002FLOOK-M) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSUSTechBruce\u002FLOOK-M.svg?style=social) |⭐️⭐️ |\n|2024.06|🔥🔥[**MInference**] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.02490) | [[MInference]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference.svg?style=social) |⭐️⭐️ |\n|2024.06|🔥🔥[**InfiniGen**] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management(@snu) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.19707) | ⚠️ |⭐️⭐️ |\n|2024.06|🔥🔥[**Quest**] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference(@mit-han-lab etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.10774)| [[Quest]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002FQuest) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002FQuest.svg?style=social) |⭐️⭐️ |\n|2024.07|🔥[PQCache] PQCache: Product Quantization-based KVCache for Long Context LLM Inference(@PKU etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.12820) | ⚠️ |⭐️⭐️ |\n|2024.08|🔥[**SentenceVAE**] SentenceVAE: Faster, Longer and More Accurate Inference with Next-sentence Prediction for Large Language Models(@TeleAI)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.00655) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥[**InstInfer**] InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference(@PKU etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.04992) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥[**RetrievalAttention**] RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval(@microsoft.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.10516)|⚠️|⭐️⭐️ |\n|2024.10|🔥[**ShadowKV**] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference(@CMU & bytedance)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.21465)|[[ShadowKV]](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FShadowKV) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FShadowKV.svg?style=social) |⭐️⭐️ |\n|2025.01|🔥🔥🔥 [**Lightning Attention**] MiniMax-01: Scaling Foundation Models with Lightning Attention | [[report]](https:\u002F\u002Ffilecdn.minimax.chat\u002F_Arxiv_MiniMax_01_Report.pdf) | [[MiniMax-01]](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FMiniMax-01) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FMiniMax-01.svg?style=social) | ⭐️⭐️ |\n|2025.06|🔥[**REFORM**] Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers(@KAIST & Amazon etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01215)|⚠️|⭐️⭐️ |\n\n### 📖Early-Exit\u002FIntermediate Layer Decoding ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Early-Exit\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2020.04|[DeeBERT] DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference(@uwaterloo.ca)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.12993.pdf)|⚠️|⭐️ |\n|2020.04|[FastBERT] FastBERT: a Self-distilling BERT with Adaptive Inference Time(@PKU)|[[pdf]](https:\u002F\u002Faclanthology.org\u002F2020.acl-main.537.pdf)|[[FastBERT]](https:\u002F\u002Fgithub.com\u002Fautoliuweijie\u002FFastBERT) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fautoliuweijie\u002FFastBERT.svg?style=social)|⭐️ |\n|2021.06|[BERxiT] BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression(@uwaterloo.ca)|[[pdf]](https:\u002F\u002Faclanthology.org\u002F2021.eacl-main.8.pdf)|[[berxit]](https:\u002F\u002Fgithub.com\u002Fcastorini\u002Fberxit) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcastorini\u002Fberxit.svg?style=social)|⭐️ |\n|2023.06|🔥[**SkipDecode**] SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference(@Microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02628) |⚠️|⭐️ |\n|2023.10|🔥[**LITE**] Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE(@Arizona State University) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.18581v2.pdf)|⚠️|⭐️⭐️ |\n|2023.12|🔥🔥[**EE-LLM**] EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism(@alibaba-inc.com) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.04916.pdf)| [[EE-LLM]](https:\u002F\u002Fgithub.com\u002Fpan-x-c\u002FEE-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpan-x-c\u002FEE-LLM.svg?style=social) |⭐️⭐️ |\n|2023.10|🔥[**FREE**] Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding(@KAIST AI&AWS AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.05424.pdf)| [[fast_robust_early_exit]](https:\u002F\u002Fgithub.com\u002Fraymin0223\u002Ffast_robust_early_exit) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fraymin0223\u002Ffast_robust_early_exit.svg?style=social) |⭐️⭐️ |\n|2024.02|🔥[**EE-Tuning**] EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models(@alibaba-inc.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.00518)| [[EE-Tuning]](https:\u002F\u002Fgithub.com\u002Fpan-x-c\u002FEE-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpan-x-c\u002FEE-LLM.svg?style=social) |⭐️⭐️ |\n|2024.07| [Skip Attention] Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models(@University College London)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.15516)|⚠️|⭐️⭐️ |\n|2024.08| [**KOALA**] KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning(@Dalian University)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.08146)|⚠️|⭐️⭐️ |\n\n### 📖Parallel Decoding\u002FSampling ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Parallel-Decoding-Sampling\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2018.11|🔥[**Parallel Decoding**] Blockwise Parallel Decoding for Deep Autoregressive Models(@Berkeley&Google)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1811.03115.pdf)|⚠️ |⭐️⭐️ |\n|2023.02|🔥[**Speculative Sampling**] Accelerating Large Language Model Decoding with Speculative Sampling(@DeepMind)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01318.pdf)| ⚠️ |⭐️⭐️ |\n|2023.05|🔥[**Speculative Sampling**] Fast Inference from Transformers via Speculative Decoding(@Google Research etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.17192.pdf)| [[LLMSpeculativeSampling]](https:\u002F\u002Fgithub.com\u002Ffeifeibear\u002FLLMSpeculativeSampling) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffeifeibear\u002FLLMSpeculativeSampling.svg?style=social) |⭐️⭐️ |\n|2023.09|🔥[**Medusa**] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.10774.pdf)|[[Medusa]](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FMedusa.svg?style=social)|⭐️⭐️ |\n|2023.10|[**OSD**] Online Speculative Decoding(@UC Berkeley etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.07177.pdf)| ⚠️ |⭐️⭐️|\n|2023.12|[**Cascade Speculative**] Cascade Speculative Drafting for Even Faster LLM Inference(@illinois.edu) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11462.pdf)| ⚠️ |⭐️|\n|2024.02|🔥[LookaheadDecoding] Break the Sequential Dependency of LLM Inference Using LOOKAHEAD DECODING(@UCSD&Google&UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.02057.pdf)| [[LookaheadDecoding]](https:\u002F\u002Fgithub.com\u002Fhao-ai-lab\u002FLookaheadDecoding) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhao-ai-lab\u002FLookaheadDecoding.svg?style=social) |⭐️⭐️ |\n|2024.02|🔥🔥[**Speculative Decoding**] Decoding Speculative Decoding(@cs.wisc.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01528.pdf)| [Decoding Speculative Decoding](https:\u002F\u002Fgithub.com\u002Fuw-mad-dash\u002Fdecoding-speculative-decoding) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fuw-mad-dash\u002Fdecoding-speculative-decoding.svg?style=social) |⭐️|\n|2024.04|🔥🔥[**TriForce**] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding(@cmu.edu&Meta AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.11912) | [[TriForce]](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FTriForce) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInfini-AI-Lab\u002FTriForce.svg?style=social)|⭐️⭐️ |\n|2024.04|🔥🔥[**Hidden Transfer**] Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration(@pku.edu.cn etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.12022.pdf)| ⚠️ |⭐️|\n|2024.05|🔥[Instructive Decoding] INSTRUCTIVE DECODING: INSTRUCTION-TUNED LARGE LANGUAGE MODELS ARE SELF-REFINER FROM NOISY INSTRUCTIONS(@KAIST AI)|[[pdf]](https:\u002F\u002Fopenreview.net\u002Fpdf?id=LebzzClHYw)| [[Instructive-Decoding]](https:\u002F\u002Fgithub.com\u002Fjoonkeekim\u002FInstructive-Decoding) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjoonkeekim\u002FInstructive-Decoding.svg?style=social)|⭐️ |\n|2024.05|🔥[S3D] S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs(@lge.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.20314)| ⚠️ |⭐️|\n|2024.06|🔥[**Parallel Decoding**] Exploring and Improving Drafts in Blockwise Parallel Decoding(@KAIST&Google Research)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.09221)|⚠️ |⭐️⭐️ |\n|2024.07|🔥[Multi-Token Speculative Decoding] Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference(@University of California, etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.09221)|⚠️ |⭐️⭐️ |\n|2024.08|🔥[Token Recycling] Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling(@ir.hit.edu.cn etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.08696)|⚠️ |⭐️⭐️ |\n|2024.08|🔥[**Speculative Decoding**] Parallel Speculative Decoding with Adaptive Draft Length(@USTC etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11850)|[[PEARL]](https:\u002F\u002Fgithub.com\u002Fsmart-lty\u002FParallelSpeculativeDecoding) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsmart-lty\u002FParallelSpeculativeDecoding.svg?style=social) |⭐️⭐️ |\n|2024.08|🔥[**FocusLLM**] FocusLLM: Scaling LLM’s Context by Parallel Decoding(@Tsinghua University etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11745)|[[FocusLLM]](https:\u002F\u002Fgithub.com\u002Fleezythu\u002FFocusLLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fleezythu\u002FFocusLLM.svg?style=social)|⭐️ |\n|2024.08|🔥[**MagicDec**] MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding(@CMU etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11049)|[[MagicDec]](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FMagicDec\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInfini-AI-Lab\u002FMagicDec.svg?style=social)|⭐️ |\n|2024.08|🔥[**Speculative Decoding**] Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation(@BIT) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15562) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥[**Hybrid Inference**] Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.13757) | ⚠️ |⭐️⭐️ |\n|2024.10|🔥[**PARALLELSPEC**] PARALLELSPEC: PARALLEL DRAFTER FOR EFFICIENT SPECULATIVE DECODING(@Tencent AI Lab etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.05589) | ⚠️ |⭐️⭐️ |\n|2024.10|🔥[**Fast Best-of-N**] Fast Best-of-N Decoding via Speculative Rejection(@CMU etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.20290) | ⚠️ |⭐️⭐️ |\n|2025.06|🔥[**Mamba Drafters**] Mamba Drafters for Speculative Decoding(@KAIST & Amazon etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01206) | ⚠️ |⭐️⭐️ |\n|2025.06|🔥[**STAND**] Accelerated Test-Time Scaling with Model-Free Speculative Sampling(@KAIST & Amazon etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04708) | ⚠️ |⭐️⭐️ |\n\n### 📖Structured Prune\u002FKD\u002FWeight Sparse ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Structured_Pruning_KD_Weight_Sparse\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.12|[**FLAP**] Fluctuation-based Adaptive Structured Pruning for Large Language Models(@Chinese Academy of Sciences etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11983.pdf)| [[FLAP]](https:\u002F\u002Fgithub.com\u002FCASIA-IVA-Lab\u002FFLAP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCASIA-IVA-Lab\u002FFLAP.svg?style=social)|⭐️⭐️ |\n|2023.12|🔥[**LASER**] The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction(@mit.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.13558.pdf)| [[laser]](https:\u002F\u002Fgithub.com\u002Fpratyushasharma\u002Flaser) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpratyushasharma\u002Flaser.svg?style=social)|⭐️⭐️ |\n|2023.12|[PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU)|[[pdf]](https:\u002F\u002Fipads.se.sjtu.edu.cn\u002F_media\u002Fpublications\u002Fpowerinfer-20231219.pdf)|[[PowerInfer]](https:\u002F\u002Fgithub.com\u002FSJTU-IPADS\u002FPowerInfer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSJTU-IPADS\u002FPowerInfer.svg?style=social)|⭐️ |\n|2024.01|[**Admm Pruning**] Fast and Optimal Weight Update for Pruned Large Language Models(@fmph.uniba.sk)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.02938.pdf)|[[admm-pruning]](https:\u002F\u002Fgithub.com\u002Ffmfi-compbio\u002Fadmm-pruning) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffmfi-compbio\u002Fadmm-pruning.svg?style=social)|⭐️ |\n|2024.01|[FFSplit] FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference(@1Rice University etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.04044.pdf) |  ⚠️ |⭐️|\n|2025.03|🔥[**Simba**] Sparsified State-Space Models are Efficient Highway Networks(@KAIST)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20698)|[[Simba]](https:\u002F\u002Fgithub.com\u002Fwoominsong\u002FSimba) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwoominsong\u002FSimba.svg?style=social)|⭐️ |\n|2025.06|[SDMPrune] SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models (@CSU)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11120) |[[SDMPrune]](https:\u002F\u002Fgithub.com\u002Fvisresearch\u002FSDMPrune)![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvisresearch\u002FSDMPrune.svg?style=social&label=Star)|⭐️⭐️ |\n|2026.03|[**HFPrune**] High-Fidelity Pruning for Large Language Models (@CSU)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.08083) |[[HFPrune]](https:\u002F\u002Fgithub.com\u002Fvisresearch\u002FHFPrune)![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvisresearch\u002FHFPrune.svg?style=social&label=Star)|⭐️⭐️ |\n\n### 📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Mixture_of_Experts_LLM_Inference\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2022.11|🔥[**WINT8\u002F4**] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production(@NVIDIA&Microsoft) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10017.pdf)|[[FasterTransformer]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FFasterTransformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FFasterTransformer.svg?style=social)|⭐️⭐️ |\n|2023.12|🔥 [**Mixtral Offloading**] Fast Inference of Mixture-of-Experts Language Models with Offloading(@Moscow Institute of Physics and Technology etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.17238.pdf)| [[mixtral-offloading]](https:\u002F\u002Fgithub.com\u002Fdvmazur\u002Fmixtral-offloading) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvmazur\u002Fmixtral-offloading.svg?style=social)|⭐️⭐️ |\n|2024.01| [MoE-Mamba] MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts(@uw.edu.pl) |  [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.04081.pdf)| ⚠️ |⭐️|\n|2024.04| [MoE Inference] Toward Inference-optimal Mixture-of-Expert Large Language Models(@UC San Diego etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.02852.pdf)| ⚠️ |⭐️|\n|2024.05| 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04434) | [[DeepSeek-V2]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V2) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V2.svg?style=social)| ⭐️⭐️ |\n|2024.06| [MoE] A Survey on Mixture of Experts(@HKU) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.06204)| ⚠️ |⭐️|\n\n\n### 📖CPU\u002FSingle GPU\u002FFPGA\u002FNPU\u002FMobile Inference ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"CPU-Single-GPU-Inference\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.03|[FlexGen] High-Throughput Generative Inference of Large Language Models  with a Single GPU(@Stanford University etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.06865.pdf)|[[FlexGen]](https:\u002F\u002Fgithub.com\u002FFMInference\u002FFlexGen) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FFlexGen.svg?style=social)|⭐️ |\n|2023.11|[LLM CPU Inference] Efficient LLM Inference on CPUs(@intel)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.00502.pdf)| [[intel-extension-for-transformers]](https:\u002F\u002Fgithub.com\u002Fintel\u002Fintel-extension-for-transformers) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fintel\u002Fintel-extension-for-transformers.svg?style=social) |⭐️ |\n|2023.12|[LinguaLinked] LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices(@University of California Irvine)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00388.pdf)|⚠️ |⭐️ |\n|2023.12|[OpenVINO] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO(@Haim Barad etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04951.pdf)|⚠️|⭐️ |\n|2024.03|[FlightLLM] FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs(@Infinigence-AI) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.03868.pdf)|⚠️|⭐️ |\n|2024.03|[Transformer-Lite] Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs(@OPPO) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2403\u002F2403.20041.pdf)|⚠️|⭐️ |\n|2024.07|🔥🔥[**xFasterTransformer**] Inference Performance Optimization for Large Language Models on CPUs(@Intel) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.07304)|[[xFasterTransformer]](https:\u002F\u002Fgithub.com\u002Fintel\u002FxFasterTransformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fintel\u002FxFasterTransformer.svg?style=social) |⭐️ |\n|2024.07| [Summary] Inference Optimization of Foundation Models on AI Accelerators(@AWS AI) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.09111)|⚠️|⭐️ |\n|2024.10| Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation(@SYSU) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.03613)|⚠️|⭐️ |\n|2024.10|🔥🔥[**FastAttention**] FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs for Efficient Inference(@huawei etc)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.16663)|⚠️|⭐️ |\n|2024.12|🔥🔥[**NITRO**] NITRO: LLM INFERENCE ON INTEL® LAPTOP NPUS(@cornell.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.11053)|[[nitro]](https:\u002F\u002Fgithub.com\u002Fabdelfattah-lab\u002Fnitro) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fabdelfattah-lab\u002Fnitro.svg?style=social)|⭐️ |\n|2025.01|📱[**Off Grid**] On-device LLM + Vision + Image Gen for iOS & Android(@alichherawalla)|[[github]](https:\u002F\u002Fgithub.com\u002Falichherawalla\u002Foff-grid-mobile)|[[off-grid-mobile]](https:\u002F\u002Fgithub.com\u002Falichherawalla\u002Foff-grid-mobile) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falichherawalla\u002Foff-grid-mobile.svg?style=social)|⭐️ |\n|2025.12|🔥[**Grail-V\u002FPSE**] Non-bijunctive Attention Collapse via POWER8 vec_perm — 8.8x CPU Inference Speedup(@Elyan Labs)|[[zenodo]](https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.14862410)|[[ram-coffers]](https:\u002F\u002Fgithub.com\u002FScottcjn\u002Fram-coffers) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FScottcjn\u002Fram-coffers.svg?style=social)|⭐️ |\n|2025.12|🔥[**llama-cpp-power8**] POWER8 optimizations for llama.cpp: vec_perm non-bijunctive collapse, IBM MASS integration, dcbt resident prefetch. 8.8x speedup over stock(@Scottcjn)|[[github]](https:\u002F\u002Fgithub.com\u002FScottcjn\u002Fllama-cpp-power8)|[[llama-cpp-power8]](https:\u002F\u002Fgithub.com\u002FScottcjn\u002Fllama-cpp-power8) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FScottcjn\u002Fllama-cpp-power8.svg?style=social)|⭐️ |\n|2025.12|🔥[**RAM Coffers**] NUMA-aware weight banking for LLM inference. Maps brain hemisphere cognitive functions to NUMA topology for intelligent routing and selective prefetch(@Scottcjn)|[[github]](https:\u002F\u002Fgithub.com\u002FScottcjn\u002Fram-coffers)|[[ram-coffers]](https:\u002F\u002Fgithub.com\u002FScottcjn\u002Fram-coffers) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FScottcjn\u002Fram-coffers.svg?style=social)|⭐️ |\n\n### 📖Non Transformer Architecture ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Non-Transformer-Architecture\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.05|🔥🔥[**RWKV**] RWKV: Reinventing RNNs for the Transformer Era(@Bo Peng etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13048.pdf)|[[RWKV-LM]](https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBlinkDL\u002FRWKV-LM.svg?style=social)|⭐️⭐️ |\n|2023.12|🔥🔥[**Mamba**] Mamba: Linear-Time Sequence Modeling with Selective State Spaces(@cs.cmu.edu etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00752.pdf)|[[mamba]](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fstate-spaces\u002Fmamba.svg?style=social)|⭐️⭐️ |\n|2024.06|🔥🔥[**RWKV-CLIP**] RWKV-CLIP: A Robust Vision-Language Representation Learner(@DeepGlint etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.06973)|[[RWKV-CLIP]](https:\u002F\u002Fgithub.com\u002Fdeepglint\u002FRWKV-CLIP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepglint\u002FRWKV-CLIP.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥🔥[Kraken] Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference(@Princeton) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.07802)|⚠️|⭐️ |\n|2024.08|🔥🔥[**FLA**] FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism(@sustcsonglin)| [[docs]](https:\u002F\u002Fgithub.com\u002Fsustcsonglin\u002Fflash-linear-attention) |[[flash-linear-attention]](https:\u002F\u002Fgithub.com\u002Fsustcsonglin\u002Fflash-linear-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsustcsonglin\u002Fflash-linear-attention.svg?style=social)|⭐️⭐️ |\n\n### 📖GEMM\u002FTensor Cores\u002FMMA\u002FParallel ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"GEMM-Tensor-Cores-WMMA\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2018.03|🔥🔥[Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision(@KTH Royal etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.04014.pdf)|⚠️|⭐️ |\n|2021.05|🔥[Intra-SM Parallelism] Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks(@sjtu.edu.cn)|[[pdf]](https:\u002F\u002Fmivenhan.github.io\u002Fpublication\u002F2021plasticine\u002F2021plasticine.pdf)|⚠️|⭐️ |\n|2022.06|[Microbenchmark] Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors(@tue.nl etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02874.pdf)|[[DissectingTensorCores]](https:\u002F\u002Fgithub.com\u002Fsunlex0717\u002FDissectingTensorCores) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsunlex0717\u002FDissectingTensorCores.svg?style=social)|⭐️ |\n|2022.09|🔥🔥[FP8] FP8 FORMATS FOR DEEP LEARNING(@NVIDIA) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05433.pdf)|⚠️|⭐️ |\n|2023.08|🔥[Tensor Cores] Reducing shared memory footprint to leverage high  throughput on Tensor Cores and its flexible API extension library(@Tokyo Institute etc) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.15152.pdf)|[[wmma_extension]](https:\u002F\u002Fgithub.com\u002Fwmmae\u002Fwmma_extension) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwmmae\u002Fwmma_extension.svg?style=social)|⭐️ |\n|2023.03|🔥🔥[**cutlass\u002Fcute**] Graphene: An IR for Optimized Tensor Computations on GPUs(@NVIDIA)|[[pdf]](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3582016.3582018)|[[cutlass]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002Fcutlass.svg?style=social)|⭐️ |\n|2024.02|[QUICK] QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference(@SqueezeBits Inc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.10076.pdf)|[[QUICK]](https:\u002F\u002Fgithub.com\u002FSqueezeBits\u002FQUICK) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeBits\u002FQUICK.svg?style=social)|⭐️⭐️ |\n|2024.02|[Tensor Parallel] TP-AWARE DEQUANTIZATION(@IBM T.J. Watson Research Center)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.04925.pdf)|⚠️|⭐️ |\n|2024.07|🔥🔥[**flute**] Fast Matrix Multiplications for Lookup Table-Quantized LLMs(@mit.edu etc) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.10960)|[[flute]](https:\u002F\u002Fgithub.com\u002FHanGuo97\u002Fflute) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHanGuo97\u002Fflute.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥🔥[**LUT TENSOR CORE**] LUT TENSOR CORE: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration(@SJTU&PKU etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.06003)|⚠️|⭐️ |\n|2024.08|🔥🔥[**MARLIN**] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models(@ISTA) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11743)|[[marlin]](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fmarlin) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fmarlin.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥🔥[**SpMM**] High Performance Unstructured SpMM Computation Using Tensor Cores(@ETH Zurich)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11551)|⚠️|⭐️ |\n|2024.09|🔥🔥[**TEE**]Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study(@phala.network)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.03992)|⚠️|⭐️ |\n|2024.09|🔥🔥[**HiFloat8**] Ascend HiFloat8 Format for Deep Learning(@Huawei)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.16626)|⚠️|⭐️ |\n|2024.09|🔥🔥[**Tensor Cores**] Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores(@nju.edu.cn)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.17870)|⚠️|⭐️ |\n|2024.07|🔥🔥[**Tensor Product**] Acceleration of Tensor-Product Operations with Tensor Cores(@Heidelberg University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.09621)|⚠️|⭐️ |\n|2024.12| 🔥🔥[**HADACORE**] HADACORE: TENSOR CORE ACCELERATED HADAMARD TRANSFORM KERNEL(@Meta)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.09621)|[[hadamard_transform]](https:\u002F\u002Fgithub.com\u002Fpytorch-labs\u002Fapplied-ai\u002Ftree\u002Fmain\u002Fkernels\u002Fcuda\u002Finference\u002Fhadamard_transform) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpytorch-labs\u002Fapplied-ai.svg?style=social)|⭐️ |\n|2024.10| 🔥🔥[**FLASH-ATTENTION RNG**] Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM(@Princeton University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.07531)|⚠️|⭐️ |\n|2025.02| 🔥🔥[**TRITONBENCH**] TRITONBENCH: Benchmarking Large Language Model Capabilities for Generating Triton Operators(@thunlp) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.14752) | [[TritonBench]](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FTritonBench) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthunlp\u002FTritonBench.svg?style=social)|⭐️⭐️ |\n|2025.04| 🔥🔥[**Triton-distributed**] TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives(@ByteDance-Seed) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.20313) | [[Triton-distributed]](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByteDance-Seed\u002FTriton-distributed.svg?style=social)|⭐️⭐️ |\n\n### 📖VLM\u002FPosition Embed\u002FOthers ([©️back👆🏻](#paperlist))\n\u003Cdiv id=\"Others\">\u003C\u002Fdiv>\n\n|Date|Title|Paper|Code|Recom|\n|:---:|:---:|:---:|:---:|:---:|\n|2021.04|🔥[RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY  POSITION EMBEDDING(@Zhuiyi Technology Co., Ltd.) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.09864.pdf)|[[transformers]](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmodel_doc\u002Froformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Ftransformers.svg?style=social)|⭐️ |\n|2022.10|[ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs(@ByteDance&NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03052.pdf)|[[ByteTransformer]](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FByteTransformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FByteTransformer.svg?style=social)|⭐️ |\n|2024.09|🔥[**Inf-MLLM**] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU(@sjtu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.09086)|⚠️|⭐️ |\n|2024.11|🔥[VL-CACHE] VL-CACHE: SPARSITY AND MODALITY-AWARE KV CACHE COMPRESSION FOR VISION-LANGUAGE MODEL INFERENCE ACCELERATION(@g.ucla.edu etc)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.23317)|⚠️|⭐️ |\n|2025.02| 🔥[**DynamicLLaVA**] [ICLR2025] Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification (@ECNU, Xiaohongshu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.00876)|[[DynamicLLaVA]](https:\u002F\u002Fgithub.com\u002FOsilly\u002Fdynamic_llava) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOsilly\u002Fdynamic_llava.svg?style=social&label=Star)|⭐️⭐️|\n\n## ©️License\n\nGNU General Public License v3.0\n\n## 🎉Contribute\n\nWelcome to star & submit a PR to this repo!\n\n\u003Cdiv align='center'>\n  \u003Cimg width=\"450\" height=\"250\" alt=\"v02\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_e21369547821.png\">\n\u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#xlite-dev\u002FAwesome-LLM-Inference&Date\">\n  \u003Cpicture align='center'>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_076fb9e294ec.png&theme=dark\" \u002F>\n    \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_076fb9e294ec.png\" \u002F>\n    \u003Cimg width=\"350\" height=\"250\" alt=\"Star History Chart\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_076fb9e294ec.png\" \u002F>\n  \u003C\u002Fpicture>\n\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n---\n\n### Part of the Elyan Labs Ecosystem\n\n- [BoTTube](https:\u002F\u002Fbottube.ai) — AI video platform where 119+ agents create content\n- [RustChain](https:\u002F\u002Frustchain.org) — Proof-of-Antiquity blockchain with hardware attestation\n- [GitHub](https:\u002F\u002Fgithub.com\u002FScottcjn)\n","\u003Cdiv align='center'>\n  \u003Cimg src=https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ffcd83ff2-7ace-4fb5-8d3b-3ccbc1ecbf87 width=250px >\n\u003C\u002Fdiv>\n\n\u003Cdiv align='center'>\n  \u003Cimg src=https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fdownloads\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Ftotal?color=ccf&label=downloads&logo=github&logoColor=lightgrey >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002FAwesome-LLM-Inference.svg?style=social >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRelease-v2.6-brightgreen.svg >\n  \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-GPLv3.0-turquoise.svg >\n \u003C\u002Fdiv>\n\n## 📒简介\nAwesome-LLM-Inference：精选的[📙包含代码的优秀大模型推理论文列表](#paperlist)。如需了解优秀的扩散模型推理相关资源，请查看📖[Awesome-DiT-Inference](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-DiT-Inference)！[](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002FAwesome-DiT-Inference.svg?style=social)。若想学习CUDA相关知识，可查阅📖[LeetCUDA](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FLeetCUDA)！[](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002FLeetCUDA.svg?style=social)。\n\n## 📖 新闻 🔥🔥\n\u003Cdiv id=\"news\">\u003C\u002Fdiv>\n\n- [2026年3月] Cache-DiT **[🎉v1.3.0](https:\u002F\u002Fgithub.com\u002Fvipshop\u002Fcache-dit)** 版本已发布，主要更新包括：支持[Ring](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL)注意力机制及[批处理P2P](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL)，[USP](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL\u002F)（环形与Ulysses混合架构），混合2D和3D并行计算（💥[USP + TP](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FHYBRID_PARALLEL\u002F)），以及降低VAE-P通信开销。\n\n![arch](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_de7e08501500.png)\n\n## ©️引用\n\n```BibTeX\n@misc{Awesome-LLM-Inference@2024,\n  title={Awesome-LLM-Inference：包含代码的优秀大模型推理论文精选},\n  url={https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference},\n  note={开源软件可在https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference获取},\n  author={xlite-dev, liyucheng09等},\n  year={2024}\n}\n```\n\n## 🎉 包含代码的大模型推理优秀论文\n\n[Awesome LLM Inference for Beginners.pdf](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Freleases\u002Fdownload\u002Fv0.3\u002FAwesome-LLM-Inference-v0.3.pdf.zip)：共500页，涵盖FastServe、FlashAttention 1\u002F2、FlexGen、FP8、LLM.int8()、PagedAttention、RoPE、SmoothQuant、WINT8\u002F4、连续批处理、ZeroQuant 1\u002F2\u002FFP、AWQ等内容。\n\n\u003Cdiv align='center'>\n\u003Cimg src=https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fassets\u002F31974251\u002F0ed77e9d-a1eb-4095-9a82-bad624964e55 >\n\u003C\u002Fdiv>\n\n## 🎉 下载所有PDF\n```bash\npython3 download_pdfs.py # 该代码由通义千问AI生成\n```\n\u003Cimg width=\"1267\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_6775f92f179b.png\" \u002F>\n\n\u003Cdiv id=\"paperlist\">\u003C\u002Fdiv>\n\n## 📖 目录\n* 📖[热门大模型\u002FVLM主题](#Trending-LLM-VLM-Topics)🔥🔥🔥\n* 📖[DeepSeek\u002FMLA主题](#mla)🔥🔥🔥\n* 📖[多GPU\u002F多节点并行计算](#DP-MP-PP-TP-SP-CP)🔥🔥🔥\n* 📖[预填充与解码分离](#P-D-Disaggregating)🔥🔥🔥\n* 📖[大模型算法与评估综述](#LLM-Algorithmic-Eval-Survey)\n* 📖[大模型训练与推理框架设计](#LLM-Train-Inference-Framework)\n* 📖[权重与激活量化\u002F压缩](#Weight-Activation-Quantize-Compress)🔥\n* 📖[连续\u002F飞行中批处理](#Continuous-In-flight-Batching)\n* 📖[IO\u002FFLOPs感知型稀疏注意力](#IO-FLOPs-Aware-Attention-Sparse)🔥\n* 📖[KV缓存调度\u002F量化\u002F丢弃](#KV-Cache-Scheduling-Quantize-Dropping)🔥\n* 📖[提示词\u002F上下文压缩](#Context-Compression)🔥\n* 📖[长上下文注意力\u002FKV缓存优化](#Long-Context-Attention-KVCache)🔥🔥\n* 📖[提前退出\u002F中间层解码](#Early-Exit)\n* 📖[并行解码\u002F采样](#Parallel-Decoding-Sampling)🔥\n* 📖[结构化剪枝\u002FKD\u002F权重稀疏化](#Structured_Pruning_KD_Weight_Sparse)\n* 📖[专家混合(MoE)大模型推理](#Mixture_of_Experts_LLM_Inference)🔥\n* 📖[CPU\u002FNPU\u002FFPGA\u002F移动端推理](#CPU-Single-GPU-Inference)\n* 📖[非Transformer架构](#Non-Transformer-Architecture)🔥\n* 📖[GEMM\u002F张量核心\u002FWMMA\u002F并行计算](#GEMM-Tensor-Cores-WMMA)\n* 📖[VLM\u002F位置嵌入及其他](#Others)\n\n### 📖热门LLM\u002FVLM主题（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"Trending-LLM-VLM-Topics\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2024.04| 🔥🔥🔥[Open-Sora] Open-Sora：为所有人 democratize 高效视频制作(@hpcaitech)|[[文档]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora\u002Fblob\u002Fmain\u002Fdocs\u002Fzh_CN\u002FREADME.md) | [[Open-Sora]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhpcaitech\u002FOpen-Sora.svg?style=social)| ⭐️⭐️ |\n|2024.04| 🔥🔥🔥[Open-Sora计划] Open-Sora计划：本项目旨在复现Sora（Open AI的T2V模型）(@PKU)|[[报告]](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan\u002Fblob\u002Fmain\u002Fdocs\u002FReport-v1.0.0.md) | [[Open-Sora-Plan]](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FOpen-Sora-Plan.svg?style=social)| ⭐️⭐️ |\n|2024.05| 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2：一款强大、经济且高效的专家混合语言模型(@DeepSeek-AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04434) | [[DeepSeek-V2]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V2) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V2.svg?style=social)| ⭐️⭐️ |\n|2024.05|🔥🔥[YOCO] 你只需缓存一次：用于语言模型的解码器-解码器架构(@Microsoft)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05254) | [[unilm-YOCO]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002FYOCO) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social) |⭐️⭐️ |\n|2024.06|🔥[**Mooncake**] Mooncake：一种以KV缓存为中心的分布式架构，用于LLM推理(@Moonshot AI) |[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v3.pdf) | [[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**FlashAttention-3**] FlashAttention-3：通过异步和低精度实现快速而准确的注意力机制(@TriDao等) |[[pdf]](https:\u002F\u002Ftridao.me\u002Fpublications\u002Fflash3\u002Fflash3.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**MInference 1.0**] MInference 1.0：通过动态稀疏注意力加速长上下文LLM的预填充(@Microsoft) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.02490)|[[MInference 1.0]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference.svg?style=social)|⭐️⭐️ |\n|2024.11|🔥🔥🔥[**Star-Attention：11倍加速**] Star Attention：高效处理长序列的LLM推理(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.17116)|[[Star-Attention]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FStar-Attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FStar-Attention.svg?style=social)|⭐️⭐️ |\n|2024.12|🔥🔥🔥[**DeepSeek-V3**] DeepSeek-V3技术报告(@deepseek-ai) | [[pdf]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3\u002Fblob\u002Fmain\u002FDeepSeek_V3.pdf) | [[DeepSeek-V3]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V3.svg?style=social) | ⭐️⭐️ |\n|2025.01|🔥🔥🔥 [**MiniMax-Text-01**] MiniMax-01：利用闪电注意力扩展基础模型 | [[报告]](https:\u002F\u002Ffilecdn.minimax.chat\u002F_Arxiv_MiniMax_01_Report.pdf) | [[MiniMax-01]](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FMiniMax-01) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FMiniMax-01.svg?style=social) | ⭐️⭐️ |\n|2025.01|🔥🔥🔥[**DeepSeek-R1**] DeepSeek-R1技术报告(@deepseek-ai) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12948v1) | [[DeepSeek-R1]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1.svg?style=social) | ⭐️⭐️ |\n\n### 📖DeepSeek\u002F多头潜在注意力(MLA)（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"mla\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2024.05| 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2：一款强大、经济且高效的专家混合语言模型(@DeepSeek-AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04434) | [[DeepSeek-V2]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V2) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V2.svg?style=social)| ⭐️⭐️ |\n|2024.12|🔥🔥🔥[**DeepSeek-V3**] DeepSeek-V3技术报告(@deepseek-ai) | [[pdf]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3\u002Fblob\u002Fmain\u002FDeepSeek_V3.pdf) | [[DeepSeek-V3]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V3.svg?style=social) | ⭐️⭐️ |\n|2025.01|🔥🔥🔥[**DeepSeek-R1**] DeepSeek-R1技术报告(@deepseek-ai) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12948v1) | [[DeepSeek-R1]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1.svg?style=social) | ⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DeepSeek-NSA**] 原生稀疏注意力：与硬件对齐且可原生训练的稀疏注意力(@deepseek-ai)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.11089)| ⚠️|⭐️⭐️ |\n|2025.02|🔥🔥🔥[**FlashMLA**] DeepSeek FlashMLA(@deepseek-ai)|⚠️| [[FlashMLA]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FFlashMLA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FFlashMLA.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DualPipe**] DeepSeek DualPipe(@deepseek-ai)|⚠️| [[DualPipe]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDualPipe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDualPipe.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DeepEP**] DeepSeek DeepEP(@deepseek-ai)|⚠️| [[DeepEP]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepEP.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**DeepGEMM**] DeepSeek DeepGEMM(@deepseek-ai)|⚠️| [[DeepGEMM]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepGEMM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepGEMM.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**EPLB**] DeepSeek EPLB(@deepseek-ai)|⚠️| [[EPLB]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FEPLB) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FEPLB.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥🔥[**3FS**] DeepSeek 3FS(@deepseek-ai)|⚠️| [[3FS]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002F3FS) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002F3FS.svg?style=social) |⭐️⭐️ |\n|2025.03|🔥🔥🔥[**推理系统**] DeepSeek-V3 \u002F R1 推理系统概览 (@deepseek-ai) | [[博客]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F27181462601) | ⚠️|⭐️⭐️ |\n|2025.02|🔥🔥[**MHA2MLA**] 通往经济型推理之路：在任何基于Transformer的LLM中启用DeepSeek的多头潜在注意力(@fudan.edu.cn)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.14837)| [[MHA2MLA]](https:\u002F\u002Fgithub.com\u002FJT-Ushio\u002FMHA2MLA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJT-Ushio\u002FMHA2MLA.svg?style=social) |⭐️⭐️ |\n|2025.02|🔥🔥[**TransMLA**] TransMLA：多头潜在注意力就是你需要的一切(@PKU)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.07864)|[[TransMLA]](https:\u002F\u002Fgithub.com\u002Ffxmeng\u002FTransMLA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffxmeng\u002FTransMLA.svg?style=social) | ⭐️⭐️ |\n|2025.03|🔥🔥[**X-EcoMLA**] X-EcoMLA：将预训练的注意力升级为MLA，实现高效且极致的KV压缩(@AMD)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.11132) |⚠️|⭐️⭐️ |\n\n### 📖多GPU\u002F多节点并行（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"DP-MP-PP-TP-SP-CP\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2019.10|🔥🔥[**MP: ZeRO**] DeepSpeed-ZeRO：面向万亿参数模型训练的内存优化（@microsoft.com）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.02054)|  [[deepspeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2020.05|🔥🔥[**TP: Megatron-LM**] Megatron-LM：使用模型并行训练数十亿参数语言模型（@NVIDIA）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1909.08053.pdf)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2022.05|🔥🔥[**SP: Megatron-LM**] Megatron-LM：降低大型Transformer模型中的激活重计算（@NVIDIA）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05198)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2023.05|🔥🔥[**SP: BPT**] 面向大上下文模型的分块并行Transformer（@UC Berkeley）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19370)| [[RingAttention]](https:\u002F\u002Fgithub.com\u002Flhao499\u002FRingAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flhao499\u002FRingAttention.svg?style=social)|⭐️⭐️ |\n|2023.10|🔥🔥[**SP: Ring Attention**] 基于分块Transformer的环形注意力机制，实现近无限上下文处理（@UC Berkeley）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01889.pdf)| [[RingAttention]](https:\u002F\u002Fgithub.com\u002Flhao499\u002FRingAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flhao499\u002FRingAttention.svg?style=social)|⭐️⭐️ |\n|2023.11|🔥🔥[**SP: STRIPED ATTENTION**] STRIPED ATTENTION：用于因果Transformer的更快环形注意力机制（@MIT等）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.09431.pdf) |[[striped_attention]](https:\u002F\u002Fgithub.com\u002Fexists-forall\u002Fstriped_attention\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fexists-forall\u002Fstriped_attention.svg?style=social) |⭐️⭐️ |\n|2023.10|🔥🔥[**SP: DEEPSPEED ULYSSES**] DEEPSPEED ULYSSES：支持极端长序列Transformer模型训练的系统优化（@microsoft.com）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.14509)|  [[deepspeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2024.03|🔥🔥[**CP: Megatron-LM**] Megatron-LM：上下文并行概述（@NVIDIA）|[[docs]](https:\u002F\u002Fdocs.nvidia.com\u002Fmegatron-core\u002Fdeveloper-guide\u002Flatest\u002Fapi-guide\u002Fcontext_parallel.html)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2024.05|🔥🔥[**SP: Unified Sequence Parallel (USP)**] YunChang：一种用于长上下文LLM模型训练和推理的统一序列并行注意力机制（@Tencent）|[[pdf]]()|[[long-context-attention]](https:\u002F\u002Fgithub.com\u002Ffeifeibear\u002Flong-context-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffeifeibear\u002Flong-context-attention.svg?style=social)|⭐️⭐️ |\n|2024.11|🔥🔥[**CP: Meta**] 上下文并行技术用于可扩展的百万标记推理（@Meta Platforms, Inc）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.01783)| ⚠️|⭐️⭐️ |\n|2024.11|🔥🔥[**TP: Comm Compression**] 用于张量并行LLM推理的通信压缩技术（@recogni.com）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.09510)| ⚠️|⭐️⭐️ |\n|2024.11|🔥🔥🔥[**SP: Star-Attention, 11x~ speedup**] Star Attention：高效处理长序列的LLM推理（@NVIDIA）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.17116)|[[Star-Attention]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FStar-Attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FStar-Attention.svg?style=social)|⭐️⭐️ |\n|2024.12|🔥🔥[**SP: TokenRing**] TokenRing：通过双向通信实现无限上下文LLM的高效并行框架（@SJTU）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.20501)|[[token-ring]](https:\u002F\u002Fgithub.com\u002FACA-Lab-SJTU\u002Ftoken-ring) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FACA-Lab-SJTU\u002Ftoken-ring.svg?style=social)|⭐️⭐️ |\n|2025.05|🔥🔥[**FSDP 1\u002F2**] PyTorch FSDP：开始使用全分片数据并行（FSDP）（@pytorch）| [[docs]](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fintermediate\u002FFSDP_tutorial.html#getting-started-with-fully-sharded-data-parallel-fsdp) | ⚠️ |⭐️⭐️ |\n\n\n### 📖解耦预填充与解码（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"P-D-Disaggregating\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2024.01|🔥🔥[**DistServe**] DistServe：为优化吞吐量而解耦预填充与解码的大语言模型服务架构（@PKU）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.09670)|[[DistServe]](https:\u002F\u002Fgithub.com\u002FLLMServe\u002FDistServe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLMServe\u002FDistServe.svg?style=social) |⭐️⭐️ |\n|2024.06|🔥🔥[**Mooncake**] Mooncake：以KV缓存为中心的解耦架构，用于LLM服务（@Moonshot AI）|[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v1.pdf) |[[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2024.12|🔥🔥[**KVDirect**] KVDirect：字节跳动推出的分布式解耦LLM推理系统|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14743)|⚠️|⭐️ |\n|2025.01|🔥🔥[**DeServe**] DESERVE：通过去中心化实现经济实惠的离线LLM推理（@Berkeley）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14784)|⚠️|⭐️ |\n|2025.04|🔥🔥[**MegaScale-Infer**] MegaScale-Infer：利用解耦专家并行技术大规模提供混合专家模型服务（@ByteDance Seed）| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.02263) |⚠️|⭐️ |\n\n### 📖大语言模型算法与评估综述（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"LLM-Algorithmic-Eval-Survey\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.10|[评估] 评估大型语言模型：综合综述(@tju.edu.cn)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.19736.pdf)|[[Awesome-LLMs-Evaluation]](https:\u002F\u002Fgithub.com\u002Ftjunlp-lab\u002FAwesome-LLMs-Evaluation-Papers)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftjunlp-lab\u002FAwesome-LLMs-Evaluation-Papers.svg?style=social) |⭐️ |\n|2023.11|🔥[**运行时性能**] 解析大型语言模型训练、微调和推理的运行时性能(@hkust-gz.edu.cn) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03687.pdf)|⚠️|⭐️⭐️ |\n|2023.11|[ChatGPT周年纪念] ChatGPT一周年：开源大型语言模型是否正在迎头赶上？(@e.ntu.edu.sg)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16989.pdf)|⚠️|⭐️ |\n|2023.12|[算法综述] 大型语言模型的效率谱：算法综述(@Microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00678.pdf)|⚠️|⭐️ |\n|2023.12|[安全与隐私] 大型语言模型（LLM）安全与隐私综述：好的、坏的与丑陋的(@Drexel University)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02003.pdf)|⚠️|⭐️ |\n|2023.12|🔥[**LLMCompass**] 面向大型语言模型推理的硬件评估框架(@princeton.edu) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03134.pdf)|⚠️|⭐️⭐️ |\n|2023.12|🔥[**高效LLM**] 高效大型语言模型：综述(@Ohio State University等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03863.pdf)|[[Efficient-LLMs-Survey]](https:\u002F\u002Fgithub.com\u002FAIoT-MLSys-Lab\u002FEfficient-LLMs-Survey)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIoT-MLSys-Lab\u002FEfficient-LLMs-Survey.svg?style=social) |⭐️⭐️ |\n|2023.12|[**服务综述**] 向高效的生成式大型语言模型服务迈进：从算法到系统的综述(@Carnegie Mellon University) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.15234.pdf)|⚠️|⭐️⭐️ |\n|2024.01|[理解LLM] 理解LLM：从训练到推理的全面概述(@Shaanxi Normal University等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.02038.pdf) | ⚠️|⭐️⭐️ |\n|2024.02|[LLM-Viewer] LLM推理揭秘：综述与Roofline模型见解(@Zhihang Yuan等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.16363.pdf)|[[LLM-Viewer]](https:\u002F\u002Fgithub.com\u002Fhahnyuan\u002FLLM-Viewer)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhahnyuan\u002FLLM-Viewer.svg?style=social) |⭐️⭐️ |\n|2024.07|[**内部一致性与自我反馈**] 大型语言模型中的内部一致性与自我反馈：综述|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.14507)| [[ICSF-Survey]](https:\u002F\u002Fgithub.com\u002FIAAR-Shanghai\u002FICSFSurvey)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIAAR-Shanghai\u002FICSFSurvey.svg?style=social) | ⭐️⭐️ |\n|2024.09|[**低比特**] 低比特大型语言模型综述：基础、系统与算法(@Beihang等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.16694) | ⚠️|⭐️⭐️ |\n|2024.10|[**LLM推理**] 大型语言模型推理加速：全面的硬件视角(@SJTU等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.04466) | ⚠️|⭐️⭐️ |\n\n### 📖大语言模型训练\u002F推理框架\u002F设计（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"LLM-Train-Inference-Framework\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2020.05|🔥[**Megatron-LM**] 使用模型并行训练数十亿参数的语言模型(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1909.08053.pdf)|[[Megatron-LM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM.svg?style=social)|⭐️⭐️ |\n|2023.03|[FlexGen] 单GPU实现大语言模型高吞吐生成式推理(@斯坦福大学等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.06865.pdf)|[[FlexGen]](https:\u002F\u002Fgithub.com\u002FFMInference\u002FFlexGen) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FFlexGen.svg?style=social)|⭐️ |\n|2023.05|[SpecInfer] 通过推测性推理和标记树验证加速生成式大语言模型服务(@北京大学等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.09781.pdf)|[[FlexFlow]](https:\u002F\u002Fgithub.com\u002Fflexflow\u002FFlexFlow\u002Ftree\u002Finference) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fflexflow\u002FFlexFlow.svg?style=social)|⭐️ |\n|2023.05|[FastServe] 面向大语言模型的快速分布式推理服务(@北京大学等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.05920.pdf)|⚠️|⭐️ |\n|2023.09|🔥[**vLLM**] 基于分页注意力的大语言模型服务高效内存管理(@UC伯克利等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.06180.pdf)|[[vllm]](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvllm-project\u002Fvllm.svg?style=social)|⭐️⭐️ |\n|2023.09|[StreamingLLM] 带注意力汇流的高效流式语言模型(@Meta AI等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.17453.pdf)|[[streaming-llm]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fstreaming-llm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fstreaming-llm.svg?style=social)|⭐️ |\n|2023.09|[Medusa] Medusa：利用多解码头加速LLM生成的简单框架(@Tianle Cai等)|[[blog]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmedusa-llm)|[[Medusa]](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FMedusa.svg?style=social)|⭐️ |\n|2023.10|🔥[**TensorRT-LLM**] NVIDIA TensorRT LLM(@NVIDIA) |[[docs]](https:\u002F\u002Fnvidia.github.io\u002FTensorRT-LLM\u002F)|[[TensorRT-LLM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FTensorRT-LLM.svg?style=social) |⭐️⭐️ |\n|2023.11|🔥[**DeepSpeed-FastGen 2x vLLM?**] DeepSpeed-FastGen：通过MII和DeepSpeed-Inference实现LLM高吞吐文本生成(@微软)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.08671.pdf) | [[deepspeed-fastgen]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2023.12|🔥🔥[**SGLang**] 使用SGLang高效编程大语言模型(@斯坦福大学等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07104)|[[sglang]](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsgl-project\u002Fsglang.svg?style=social) |⭐️⭐️ |\n|2023.12|🔥[**PETALS**] 通过互联网进行大语言模型的分布式推理与微调(@HSE大学等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08361.pdf)|[[petals]](https:\u002F\u002Fgithub.com\u002Fbigscience-workshop\u002Fpetals) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbigscience-workshop\u002Fpetals.svg?style=social)|⭐️⭐️ |\n|2023.10|[LightSeq] LightSeq：面向长上下文Transformer分布式训练的序列级并行(@UC伯克利等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03294.pdf)|[[LightSeq]](https:\u002F\u002Fgithub.com\u002FRulinShao\u002FLightSeq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRulinShao\u002FLightSeq.svg?style=social)|⭐️ |\n|2023.12|[PowerInfer] PowerInfer：使用消费级GPU实现快速大语言模型服务(@上海交大)|[[pdf]](https:\u002F\u002Fipads.se.sjtu.edu.cn\u002F_media\u002Fpublications\u002Fpowerinfer-20231219.pdf)|[[PowerInfer]](https:\u002F\u002Fgithub.com\u002FSJTU-IPADS\u002FPowerInfer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSJTU-IPADS\u002FPowerInfer.svg?style=social)|⭐️ |\n|2024.01|[inferflow] INFERFLOW：面向大语言模型的高效且高度可配置的推理引擎(@腾讯AI实验室)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.08294.pdf) | [[inferflow]](https:\u002F\u002Fgithub.com\u002Finferflow\u002Finferflow) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finferflow\u002Finferflow.svg?style=social)|⭐️ |\n|2024.06|🔥[**Mooncake**] Mooncake：以KV缓存为中心的解耦架构，用于LLM服务(@Moonshot AI) |[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v1.pdf) | [[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2023.06|🔥[**LMDeploy**] LMDeploy：一个用于压缩、部署和部署LLM的工具包(@InternLM) |[[docs]](https:\u002F\u002Flmdeploy.readthedocs.io\u002Fen\u002Flatest\u002F) | [[lmdeploy]](https:\u002F\u002Fgithub.com\u002FInternLM\u002Flmdeploy) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002Flmdeploy.svg?style=social)|⭐️⭐️ |\n|2023.05|🔥[**MLC-LLM**] 具有ML编译功能的通用LLM部署引擎(@mlc-ai) | [[docs]](https:\u002F\u002Fllm.mlc.ai\u002F) | [[mlc-llm]](https:\u002F\u002Fgithub.com\u002Fmlc-ai\u002Fmlc-llm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlc-ai\u002Fmlc-llm.svg?style=social)|⭐️⭐️ |\n|2023.08|🔥[**LightLLM**] LightLLM是一个基于Python的LLM（大语言模型）推理和服务框架(@ModelTC) | [[docs]](https:\u002F\u002Fgithub.com\u002FModelTC\u002Flightllm) | [[lightllm]](https:\u002F\u002Fgithub.com\u002FModelTC\u002Flightllm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FModelTC\u002Flightllm.svg?style=social)|⭐️⭐️ |\n|2023.03|🔥[**llama.cpp**] llama.cpp：用纯C\u002FC++实现Meta的LLaMA模型（及其他模型）的推理(@ggerganov) |[[docs]](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) | [[llama.cpp]](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fggerganov\u002Fllama.cpp.svg?style=social)|⭐️⭐️ |\n|2024.02|🔥[**flashinfer**] FlashInfer：LLM服务的内核库(@flashinfer-ai) |[[docs]](https:\u002F\u002Fflashinfer.ai\u002F2024\u002F02\u002F02\u002Fcascade-inference.html)|[[flashinfer]](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fflashinfer-ai\u002Fflashinfer.svg?style=social)|⭐️⭐️ |\n|2024.06|🔥[**Mooncake**] Mooncake：以KV缓存为中心的解耦架构，用于LLM服务(@Moonshot AI) |[[pdf]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake\u002Fblob\u002Fmain\u002FMooncake-v1.pdf) | [[Mooncake]](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002FMooncake) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkvcache-ai\u002FMooncake.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥[DynamoLLM] DynamoLLM：为性能与能效设计LLM推理集群(@微软Azure研究)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.00741)|⚠️|⭐️ |\n|2024.08|🔥[NanoFlow] NanoFlow：通过优化实现大语言模型服务的最佳吞吐量(@华盛顿大学)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.12757)|[[Nanoflow]](https:\u002F\u002Fgithub.com\u002Fefeslab\u002FNanoflow) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fefeslab\u002FNanoflow.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥[**去中心化LLM**] 基于边缘网络和能量收集的去中心化LLM推理(@帕多瓦)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15907)|⚠️|⭐️ |\n|2024.11| 🔥[**SparseInfer**] SparseInfer：无需训练即可预测激活稀疏性，从而实现快速LLM推理(@首尔大学等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.12692)|⚠️|⭐️ |\n|2025.04|🔥[prima.cpp] PRIMA.CPP：在低资源日常家庭集群上加速70B规模LLM推理(@MBZUAI等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.08791)|[[prima.cpp]](https:\u002F\u002Fgithub.com\u002FLizonghang\u002Fprima.cpp) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLizonghang\u002Fprima.cpp.svg?style=social)|⭐️|\n|2025.07|🔥[**siiRL**] DistFlow：一个完全分布式的强化学习框架，用于可扩展且高效的LLM后训练(@上海创新研究院)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.13833)|[[siiRL]](https:\u002F\u002Fgithub.com\u002Fsii-research\u002FsiiRL)\u003Cbr> ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsii-research\u002FsiiRL.svg?style=social)|⭐️⭐️ |\n\n### 📖连续\u002F运行时批处理  ([©️返回👆🏻](#paperlist))\n\u003Cdiv id=\"Continuous-In-flight-Batching\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2022.07|🔥[**连续批处理**] Orca：基于Transformer的生成模型分布式推理系统（首尔国立大学等） |[[pdf]](https:\u002F\u002Fwww.usenix.org\u002Fsystem\u002Ffiles\u002Fosdi22-yu.pdf)|⚠️|⭐️⭐️ |\n|2023.10|🔥[**运行时批处理**] NVIDIA TensorRT LLM 批处理管理器（NVIDIA） |[[docs]](https:\u002F\u002Fnvidia.github.io\u002FTensorRT-LLM\u002Fbatch_manager.html)|[[TensorRT-LLM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FTensorRT-LLM.svg?style=social) |⭐️⭐️ |\n|2023.11|🔥[**DeepSpeed-FastGen 2倍于vLLM？**] DeepSpeed-FastGen：通过MII和DeepSpeed-Inference实现高吞吐量的LLM文本生成（微软）| [[blog]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed\u002Ftree\u002Fmaster\u002Fblogs\u002Fdeepspeed-fastgen) | [[deepspeed-fastgen]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social) |⭐️⭐️ |\n|2023.11|[Splitwise] Splitwise：利用阶段拆分实现高效的生成式LLM推理（微软等）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18677.pdf)|⚠️ |⭐️ |\n|2023.12|[SpotServe] SpotServe：在抢占式实例上服务生成式大型语言模型（cmu.edu等）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.15566.pdf)|[[SpotServe]](https:\u002F\u002Fgithub.com\u002FHsword\u002FSpotServe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHsword\u002FSpotServe.svg?style=social)|⭐️ |\n|2023.10|[LightSeq] LightSeq：用于长上下文Transformer分布式训练的序列级并行计算（UC Berkeley等）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03294.pdf)|[[LightSeq]](https:\u002F\u002Fgithub.com\u002FRulinShao\u002FLightSeq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRulinShao\u002FLightSeq.svg?style=social)|⭐️ |\n|2024.05|🔥[vAttention] vAttention：无需PagedAttention即可实现LLM服务的动态内存管理（微软印度研究院）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04437)|[[vAttention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fvattention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fvattention.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**vTensor**] vTensor：用于高效LLM服务的灵活虚拟张量管理（上海交通大学等）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.15309)|[[vTensor]](https:\u002F\u002Fgithub.com\u002Fintelligent-machine-learning\u002Fglake\u002Ftree\u002Fmaster\u002FGLakeServe) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fintelligent-machine-learning\u002Fglake.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥[自动推理引擎调优] 通过自动推理引擎调优实现SLO优化的LLM服务（南京大学等）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.04323)|⚠️|⭐️⭐️ |\n|2024.08|🔥[**SJF调度**] 通过学习排序实现高效的LLM调度（UCSD等）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15792)|⚠️|⭐️⭐️ |\n|2024.12|🔥[**BatchLLM**] BatchLLM：通过全局前缀共享和面向吞吐量的标记批处理优化大规模批量LLM推理（微软）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03594)|⚠️|⭐️⭐️ |\n\n### 📖权重\u002F激活量化\u002F压缩 ([©️返回👆🏻](#paperlist))\n\u003Cdiv id=\"Weight-Activation-Quantize-Compress\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2022.06|🔥[**ZeroQuant**] 面向大规模Transformer的高效且经济的后训练量化(@微软) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01861.pdf)|[[DeepSpeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social)|⭐️⭐️ |\n|2022.08|[FP8-Quantization] FP8量化：指数的力量(@高通AI研究) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09225.pdf) | [[FP8-quantization]](https:\u002F\u002Fgithub.com\u002FQualcomm-AI-research\u002FFP8-quantization) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQualcomm-AI-research\u002FFP8-quantization.svg?style=social) |⭐️ |\n|2022.08|[LLM.int8()] 面向大规模Transformer的8位矩阵乘法(@Facebook AI Research等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07339.pdf)|[[bitsandbytes]](https:\u002F\u002Fgithub.com\u002Ftimdettmers\u002Fbitsandbytes) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftimdettmers\u002Fbitsandbytes.svg?style=social)|⭐️ |\n|2022.10|🔥[**GPTQ**] GPTQ：生成式预训练Transformer的精确后训练量化(@IST奥地利等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.17323.pdf) |[[gptq]](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fgptq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fgptq.svg?style=social)|⭐️⭐️ |\n|2022.11|🔥[**WINT8\u002F4**] 谁说大象不能跑：将大规模MoE模型带入云端规模生产(@NVIDIA&微软) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10017.pdf)|[[FasterTransformer]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FFasterTransformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FFasterTransformer.svg?style=social)|⭐️⭐️ |\n|2022.11|🔥[**SmoothQuant**] 大型语言模型的精准高效后训练量化(@MIT等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10438.pdf)|[[smoothquant]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fsmoothquant) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fsmoothquant.svg?style=social)|⭐️⭐️ |\n|2023.03|[ZeroQuant-V2] 从全面研究到低秩补偿，探索LLM中的后训练量化(@微软)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.08302.pdf)|[[DeepSpeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social)|⭐️ |\n|2023.06|🔥[**AWQ**] AWQ：面向LLM压缩与加速的激活感知权重量化(@MIT等)|[[pdf]](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2306.00978.pdf)|[[llm-awq]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fllm-awq.svg?style=social)|⭐️⭐️ |\n|2023.06|[SpQR] SpQR：用于近无损LLM权重压缩的稀疏量化表示(@华盛顿大学等)|[[pdf]](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2306.03078.pdf)|[[SpQR]](https:\u002F\u002Fgithub.com\u002FVahe1994\u002FSpQR) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVahe1994\u002FSpQR.svg?style=social)|⭐️ |\n|2023.06|[SqueezeLLM] SQUEEZELLM：稠密与稀疏量化(@berkeley.edu) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.07629.pdf) | [[SqueezeLLM]](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezeLLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FSqueezeLLM.svg?style=social) |⭐️ |\n|2023.07|[ZeroQuant-FP] 使用浮点格式实现LLM后训练W4A8量化的飞跃(@微软)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.09782.pdf)|[[DeepSpeed]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDeepSpeed.svg?style=social)|⭐️ |\n|2023.09|[KV缓存FP8 + WINT4] 关于LLM推理性能优化的探索(@HPC4AI) | [[博客]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F653735572)|⚠️|⭐️ |\n|2023.10|[FP8-LM] FP8-LM：训练FP8大型语言模型(@微软等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.18313.pdf)| [[MS-AMP]](https:\u002F\u002Fgithub.com\u002FAzure\u002FMS-AMP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAzure\u002FMS-AMP.svg?style=social) |⭐️ |\n|2023.10|[LLM-Shearing] SHEARED LLAMA：通过结构化剪枝加速语言模型预训练(@cs.princeton.edu等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.06694.pdf) | [[LLM-Shearing]](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FLLM-Shearing) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-nlp\u002FLLM-Shearing.svg?style=social)  |⭐️ |\n|2023.10|[LLM-FP4] LLM-FP4：4位浮点量化Transformer(@ust.hk&meta等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16836.pdf) | [[LLM-FP4]](https:\u002F\u002Fgithub.com\u002Fnbasyl\u002FLLM-FP4) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnbasyl\u002FLLM-FP4.svg?style=social) |⭐️ |\n|2023.11|[2-bit LLM] 在GPU上实现快速2位LLM：内存对齐、稀疏异常值与异步反量化(@上海交通大学等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16442.pdf)|⚠️ |⭐️ |\n|2023.12|[**SmoothQuant+**] SmoothQuant+：面向LLM的精准高效4位后训练权重量化(@中兴通讯)  | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03788.pdf) | [[smoothquantplus]](https:\u002F\u002Fgithub.com\u002FAdlik\u002Fsmoothquantplus) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAdlik\u002Fsmoothquantplus.svg?style=social) |⭐️ |\n|2023.11|[OdysseyLLM W4A8] 部署级LLM量化的速度之旅(@meituan.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.09550.pdf)|⚠️|⭐️ |\n|2023.12|🔥[**SparQ**] SPARQ注意力：带宽高效的LLM推理(@graphcore.ai)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.04985.pdf)|⚠️|⭐️⭐️ |\n|2023.12|[Agile-Quant] Agile-Quant：面向边缘端LLM更快速推理的激活引导量化(@东北大学&Oracle)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.05693.pdf)|⚠️|⭐️ |\n|2023.12|[CBQ] CBQ：面向大型语言模型的跨块量化(@ustc.edu.cn)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07950.pdf)|⚠️|⭐️ |\n|2023.10|[QLLM] QLLM：面向大型语言模型的精准高效低比特量化(@ZIP Lab&SenseTime Research等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.08041.pdf)|⚠️|⭐️ |\n|2024.01|[FP6-LLM] FP6-LLM：以FP6为中心的算法-系统协同设计，高效服务大型语言模型(@微软等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.14112.pdf)|⚠️|⭐️ |\n|2024.05|🔥🔥[**W4A8KV4**] QServe：面向高效LLM服务的W4A8KV4量化与系统协同设计(@MIT&NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04532)|[[qserve]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fqserve) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fqserve.svg?style=social) |⭐️⭐️ |\n|2024.05|🔥[SpinQuant] SpinQuant：带有学习旋转的LLM量化(@Meta)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.16406)|⚠️|⭐️ |\n|2024.05|🔥[I-LLM] I-LLM：面向全量化低比特大型语言模型的高效纯整数推理(@Houmo AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.17849)|⚠️|⭐️ |\n|2024.06|🔥[OutlierTune] OutlierTune：面向大型语言模型的高效逐通道量化(@北京大学)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.18832)|⚠️|⭐️ |\n|2024.06|🔥[GPTQT] GPTQT：为提升效率对大型语言模型进行两次量化(@zju)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.02891)|⚠️|⭐️ |\n|2024.08|🔥[ABQ-LLM] ABQ-LLM：面向大型语言模型的任意比特量化推理加速(@字节跳动)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.08554)|[[ABQ-LLM]](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FABQ-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FABQ-LLM.svg?style=social)|⭐️ |\n|2024.08|🔥[1-bit LLMs] 在1位LLM时代，要么做矩阵乘法，要么不做(@南卡罗来纳大学)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11939)|⚠️|⭐️ |\n|2024.08|🔥[ACTIVATION SPARSITY] 大型语言模型中的免训练激活稀疏化(@MIT等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.14690)|[[TEAL]](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FTEAL) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FTEAL.svg?style=social)|⭐️ |\n|2024.09|🔥[VPTQ] VPTQ：面向大型语言模型的极低比特向量后训练量化(@微软)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.17066)|[[VPTQ]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVPTQ) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FVPTQ.svg?style=social)|⭐️ |\n|2024.11|🔥[BitNet] BitNet a4.8：面向1位LLM的4位激活(@微软)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.04965)|[[bitnet]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Fbitnet) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social)|⭐️ |\n|2025.04|🔥[**BitNet v2**] BitNet v2：面向1位LLM的原生4位激活与哈达玛变换(@微软)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.18415)|[[bitnet]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Fbitnet) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social)|⭐️ |\n|2025.05|🔥[**GuidedQuant**] GuidedQuant：利用末端损失指导的大语言模型量化(@SNU&SamsungAILab&Google) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.07004) |[[GuidedQuant]](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FGuidedQuant) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FGuidedQuant.svg?style=social)|⭐️⭐️ |\n\n### 📖输入\u002F浮点运算次数感知的稀疏注意力（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"IO-FLOPs-Aware-Attention-Sparse\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2018.05| [在线Softmax] 用于Softmax的在线归一化计算(@NVIDIA) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1805.02867.pdf)|⚠️|⭐️ |\n|2019.11|🔥[MQA] 快速Transformer解码：一个写头就够了(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1911.02150.pdf)|⚠️|⭐️⭐️ |\n|2020.10|[哈希注意力] REFORMER：高效的Transformer(@Google)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2001.04451.pdf)|[[reformer]](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Ftrax\u002Ftree\u002Fmaster\u002Ftrax\u002Fmodels\u002Freformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Ftrax.svg?style=social)|⭐️⭐️ |\n|2022.05|🔥[**FlashAttention**] 具有IO感知的快速且内存高效的精确注意力(@斯坦福大学等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14135.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2022.10|[在线Softmax] 自注意力并不需要O(n^2)的内存(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05682.pdf) | ⚠️ |⭐️ |\n|2023.05|[FlashAttention] 从在线Softmax到FlashAttention(@cs.washington.edu)|[[pdf]](https:\u002F\u002Fcourses.cs.washington.edu\u002Fcourses\u002Fcse599m\u002F23sp\u002Fnotes\u002Fflashattn.pdf)|⚠️|⭐️⭐️ |\n|2023.05|[FLOP, I\u002FO] 解析GPT推理中的批处理效应(@Lequn Chen) | [[blog]](https:\u002F\u002Fle.qun.ch\u002Fen\u002Fblog\u002F2023\u002F05\u002F13\u002Ftransformer-batching\u002F) | ⚠️ |⭐️ |\n|2023.05|🔥🔥[**GQA**] GQA：从多头检查点训练广义多查询Transformer模型(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13245.pdf)|[[flaxformer]](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fflaxformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Fflaxformer.svg?style=social) |⭐️⭐️ |\n|2023.06|[稀疏FlashAttention] 通过稀疏FlashAttention加速大序列上的因果注意力(@EPFL等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.01160.pdf) | [[dynamic-sparse-flash-attention]](https:\u002F\u002Fgithub.com\u002Fepfml\u002Fdynamic-sparse-flash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fepfml\u002Fdynamic-sparse-flash-attention.svg?style=social)|⭐️ |\n|2023.07|🔥[**FlashAttention-2**] 更好的并行性和工作划分带来的更快注意力(@斯坦福大学等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.08691.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2023.10|🔥[**Flash-Decoding**] 用于长上下文推理的Flash-Decoding(@斯坦福大学等)|[[blog]](https:\u002F\u002Fcrfm.stanford.edu\u002F2023\u002F10\u002F12\u002Fflashdecoding.html)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2023.11|[Flash-Decoding++] FLASHDECODING++：在GPU上更快速的大语言模型推理(@清华大学&Infinigence-AI) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01282.pdf) | ⚠️ |⭐️ |\n|2023.01|[SparseGPT] SparseGPT：大规模语言模型可以一次性被准确剪枝(@ISTA等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.00774.pdf)| [[sparsegpt]](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fsparsegpt) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fsparsegpt.svg?style=social) |⭐️ |\n|2023.12|🔥[**GLA**] 具有硬件高效训练的门控线性注意力Transformer(@MIT-IBM Watson AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06635.pdf)|[gated_linear_attention](https:\u002F\u002Fgithub.com\u002Fberlino\u002Fgated_linear_attention)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fberlino\u002Fgated_linear_attention.svg?style=social)|⭐️⭐️ |\n|2023.12|[SCCA] SCCA：用于长上下文语义扩展的移位跨块注意力(@北京航空航天大学)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07305.pdf) | ⚠️ |⭐️ |\n|2023.12|🔥[**FlashLLM**] 闪存中的LLM：有限内存下的高效大语言模型推理(@Apple)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11514.pdf) | ⚠️ |⭐️⭐️ |\n|2024.03|🔥🔥[CHAI] CHAI：用于高效LLM推理的聚类头部注意力(@cs.wisc.edu等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.08058.pdf) | ⚠️ |⭐️⭐️ |\n|2024.04|🔥🔥[DeFT] DeFT：使用Flash树注意力进行高效树结构LLM推理(@西湖大学等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.00242) | ⚠️ |⭐️⭐️ |\n|2024.04|[MoA] MoA：用于自动大语言模型压缩的稀疏注意力混合(@thu等.)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.14909) | [[MoA]](https:\u002F\u002Fgithub.com\u002Fthu-nics\u002FMoA) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-nics\u002FMoA.svg?style=social) | ⭐️ |\n|2024.07|🔥🔥[**FlashAttention-3**] FlashAttention-3：具有异步和低精度的快速且准确的注意力(@TriDao等)|[[pdf]](https:\u002F\u002Ftridao.me\u002Fpublications\u002Fflash3\u002Fflash3.pdf)|[[flash-attention]](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDao-AILab\u002Fflash-attention.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[**MInference 1.0**] MInference 1.0：通过动态稀疏注意力加速长上下文LLM的预填充(@微软)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.02490)|[[MInference 1.0]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference.svg?style=social)|⭐️⭐️ |\n|2024.07|🔥🔥[共享注意力] 超越KV缓存：用于高效LLM的共享注意力(@九州大学等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.12866) | [[shareAtt]](https:\u002F\u002Fgithub.com\u002Fmetacarbon\u002FshareAtt) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmetacarbon\u002FshareAtt.svg?style=social) | ⭐️ |\n|2024.09|🔥🔥[**CHESS**] CHESS：通过通道级阈值和选择性稀疏化优化LLM推理(@武汉大学)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.01366) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥🔥[INT-FLASHATTENTION] INT-FLASHATTENTION：为INT8量化启用Flash注意力(@PKU等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.16997)| [[INT-FlashAttention]](https:\u002F\u002Fgithub.com\u002FINT-FlashAttention2024\u002FINT-FlashAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FINT-FlashAttention2024\u002FINT-FlashAttention.svg?style=social) | ⭐️ |\n|2024.10|🔥🔥[**SageAttention**] SAGEATTENTION：用于即插即用推理加速的精确8位注意力(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.02367)|[[SageAttention]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSageAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSageAttention) | ⭐️⭐️ |\n|2024.11|🔥🔥[**SageAttention-2**] SageAttention2：通过彻底的异常值平滑和线程级INT4量化实现高效注意力(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.10958)|[[SageAttention]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSageAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSageAttention) | ⭐️⭐️ |\n|2024.11|🔥🔥[**Squeezed Attention**] 压缩注意力：加速长上下文LLM推理(@UC Berkeley) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.09688)|[[SqueezedAttention]](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezedAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FSqueezedAttention) | ⭐️⭐️ |\n|2024.12|🔥🔥[**TurboAttention**] TURBOATTENTION：适用于高吞吐量LLM的高效注意力近似(@微软)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.08585)| ⚠️ |⭐️⭐️ |\n|2025.01|🔥🔥[**FFPA**] FFPA：另一种更快的Flash预填充注意力，对于head dim > 256，SRAM复杂度为O(1)，比SDPA EA快约1.5倍(@xlite-dev)|[[docs]](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)| [[ffpa-attn]](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️ |\n|2025.03|🔥🔥[**SpargeAttention**] SpargeAttn：准确的稀疏注意力，可加速任何模型的推理(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.18137)|[[SpargeAttn]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSpargeAttn) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSpargeAttn) | ⭐️⭐️ |\n|2025.04|🔥🔥[**MMInference**] MMInference：通过模态感知的置换稀疏注意力加速长上下文视觉语言模型的预填充(@微软) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.16083)|[[MInference]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference) | ⭐️⭐️ |\n|2025.04|🔥🔥[**Sparse Frontier**] 稀疏前沿：Transformer LLM中的稀疏注意力权衡(@Cohere) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.17768)|[[SparseFrontier]](https:\u002F\u002Fgithub.com\u002FPiotrNawrot\u002Fsparse-frontier) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPiotrNawrot\u002Fsparse-frontier) | ⭐️⭐️ |\n|2024.12|🔥🔥[**Flex Attention**] FLEX ATTENTION：用于生成优化注意力核的编程模型(@pytorch) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05496)|[[attention-gym]](https:\u002F\u002Fgithub.com\u002Fpytorch-labs\u002Fattention-gym) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpytorch-labs\u002Fattention-gym) | ⭐️⭐️ |\n|2025.02| 🔥🔥🔥[**SeerAttention**] SeerAttention：在您的LLM中学习内在的稀疏注意力(@微软) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13276) | [[SeerAttention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSeerAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FSeerAttention.svg?style=social) | ⭐️⭐️⭐️ |\n|2025.03| [**Slim attention**] Slim attention：在不损失精度的情况下将上下文内存减半，MHA只需要K-cache即可(@OpenMachine.ai) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.05840) | [[OpenMchine]](https:\u002F\u002Fgithub.com\u002FOpenMachine-ai\u002Ftransformer-tricks) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenMachine-ai\u002Ftransformer-tricks.svg?style=social) | ⭐️⭐️⭐️ |\n|2025.05|🔥🔥[**SageAttention-3**] SageAttention3：用于推理的微型FP4注意力以及8位训练的探索(@thu-ml)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.11594)|[[SageAttention]](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSageAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002FSageAttention) | ⭐️⭐️ |\n|2025.04|🔥🔥[**Parallel Encoding**] APE：通过自适应并行编码实现更快、更长的上下文增强生成(@cmu.edu&NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.05431)|[[APE]](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FAPE) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInfini-AI-Lab\u002FAPE) | ⭐️⭐️ |\n|2025.04|🔥🔥[**Parallel Encoding**] 块注意力用于高效预填充(@腾讯等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.15355)|[[Block-attention]](https:\u002F\u002Fgithub.com\u002FTemporaryLoRA\u002FBlock-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTemporaryLoRA\u002FBlock-attention) | ⭐️⭐️ |\n\n### 📖KV缓存调度\u002F量化\u002F丢弃（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"KV-Cache-Scheduling-Quantize-Dropping\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2019.11|🔥[MQA] 快速Transformer解码：一个写头就够了(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1911.02150.pdf)|⚠️|⭐️⭐️ |\n|2022.06|[LTP] 针对Transformer的可学习令牌剪枝(@UC Berkeley等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.00910.pdf)|[[LTP]](https:\u002F\u002Fgithub.com\u002Fkssteven418\u002FLTP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkssteven418\u002FLTP.svg?style=social)|⭐️ |\n|2023.05|🔥🔥[**GQA**] GQA：从多头检查点训练广义多查询Transformer模型(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13245.pdf)|[[flaxformer]](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fflaxformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Fflaxformer.svg?style=social) |⭐️⭐️ |\n|2023.05|[KV缓存压缩] Scissorhands：利用重要性持久性假设在推理时压缩LLM的KV缓存(@)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.17118.pdf)|⚠️|⭐️⭐️ |\n|2023.06|[H2O] H2O：用于大型语言模型高效生成式推理的重击者Oracle(@Rice University等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14048.pdf)|[[H2O]](https:\u002F\u002Fgithub.com\u002FFMInference\u002FH2O) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FH2O.svg?style=social) |⭐️ |\n|2023.06|[QK稀疏\u002F丢弃注意力] 通过稀疏Flash注意力加速大序列上的因果注意力(@EPFL等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.01160.pdf) | [[dynamic-sparse-flash-attention]](https:\u002F\u002Fgithub.com\u002Fepfml\u002Fdynamic-sparse-flash-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fepfml\u002Fdynamic-sparse-flash-attention.svg?style=social)|⭐️ |\n|2023.08|🔥🔥[分块预填充] SARATHI：通过分块预填充捎带解码实现高效的LLM推理(@Microsoft等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16369.pdf)|⚠️|⭐️⭐️ |\n|2023.09|🔥🔥[**PagedAttention**] 使用PagedAttention实现大型语言模型服务中的高效内存管理(@UC Berkeley等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.06180.pdf)|[[vllm]](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvllm-project\u002Fvllm.svg?style=social)|⭐️⭐️ |\n|2023.09|[KV缓存FP8 + WINT4] 关于LLM推理性能优化的探索(@HPC4AI) | [[博客]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F653735572)|⚠️|⭐️ |\n|2023.10|🔥[**TensorRT-LLM KV缓存FP8**] NVIDIA TensorRT LLM(@NVIDIA) |[[文档]](https:\u002F\u002Fnvidia.github.io\u002FTensorRT-LLM\u002Fprecision.html)|[[TensorRT-LLM]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FTensorRT-LLM.svg?style=social) |⭐️⭐️ |\n|2023.10|🔥[**自适应KV缓存压缩**] 模型告诉你该丢弃什么：针对LLMs的自适应KV缓存压缩(@illinois.edu&microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01801.pdf)|⚠️|⭐️⭐️ |\n|2023.10|[CacheGen] CacheGen：面向语言模型应用的快速上下文加载(@Chicago University&Microsoft)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.07240.pdf)|[[LMCache]](https:\u002F\u002Fgithub.com\u002FLMCache\u002FLMCache) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLMCache\u002FLMCache.svg?style=social)|⭐️ |\n|2023.12|[KV缓存优化] 利用推测采样与KV缓存优化协同提升使用OpenVINO的生成式AI性能(@Haim Barad等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04951.pdf)|⚠️|⭐️ |\n|2023.12|[LoRA辅助的KV缓存压缩] 面向在线语言模型交互的压缩上下文记忆(@SNU & NAVER AI) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03414.pdf)|[[Compressed-Context-Memory]](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FContext-Memory) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FContext-Memory.svg?style=social) |⭐️⭐️ |\n|2023.12|🔥🔥[**RadixAttention**] 使用SGLang高效编程大型语言模型(@Stanford University等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07104)|[[sglang]](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsgl-project\u002Fsglang.svg?style=social) |⭐️⭐️ |\n|2024.01|🔥🔥[**DistKV-LLM**] Infinite-LLM：借助DistAttention和分布式KV缓存实现长上下文的高效LLM服务(@Alibaba等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.02669.pdf)|⚠️|⭐️⭐️ |\n|2024.02|🔥🔥[提示缓存] 通过嵌入相似度实现高效的提示缓存(@UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01173.pdf)|⚠️|⭐️⭐️ |\n|2024.02|🔥🔥[Less] 以LESS获得更多：通过KV缓存压缩合成递归以实现高效的LLM推理(@CMU等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.09398.pdf)|⚠️|⭐️ |\n|2024.02|🔥🔥[MiKV] 不遗漏任何一个令牌：基于重要性感知的混合精度量化实现可靠的KV缓存压缩(@KAIST)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.18096.pdf)|⚠️|⭐️ |\n|2024.02|🔥🔥[**共享前缀**] Hydragen：利用共享前缀实现高吞吐量的LLM推理 | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.05099.pdf)|⚠️|⭐️⭐️ |\n|2024.02|🔥🔥[**ChunkAttention**] ChunkAttention：带有前缀感知KV缓存和两阶段分区的高效自注意力(@microsoft.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.15220)|[[chunk-attention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fchunk-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fchunk-attention.svg?style=social) |⭐️⭐️ |\n|2024.03|🔥[QAQ] QAQ：适用于LLM KV缓存的质量适应性量化(@smail.nju.edu.cn)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.04643.pdf)|[[QAQ-KVCacheQuantization]](https:\u002F\u002Fgithub.com\u002FClubieDong\u002FQAQ-KVCacheQuantization) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FClubieDong\u002FQAQ-KVCacheQuantization.svg?style=social) |⭐️⭐️ |\n|2024.03|🔥🔥[DMC] 动态内存压缩：为LLMs加装加速推理的“外挂”(@NVIDIA等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.09636.pdf)|⚠️|⭐️⭐️ |\n|2024.03|🔥🔥[Keyformer] Keyformer：通过关键令牌选择减少KV缓存，实现高效生成式推理(@ece.ubc.ca等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.09054.pdf)|[[Keyformer]](https:\u002F\u002Fgithub.com\u002Fd-matrix-ai\u002Fkeyformer-llm) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fd-matrix-ai\u002Fkeyformer-llm.svg?style=social)|⭐️⭐️ |\n|2024.03|[FASTDECODE] FASTDECODE：利用异构资源实现高吞吐量且GPU高效的LLM服务(@Tsinghua University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.11421.pdf)|⚠️|⭐️⭐️ |\n|2024.03|[稀疏感知KV缓存] ALISA：通过稀疏感知KV缓存加速大型语言模型推理(@ucf.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.17312.pdf)|⚠️|⭐️⭐️ |\n|2024.03|🔥[GEAR] GEAR：近乎无损生成式推理的高效KV缓存压缩方案(@gatech.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.05527)|[[GEAR]](https:\u002F\u002Fgithub.com\u002Fopengear-project\u002FGEAR) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopengear-project\u002FGEAR.svg?style=social)|⭐️ |\n|2024.04|[SqueezeAttention] SQUEEZEATTENTION：通过逐层最优预算实现LLM推理中KV缓存的二维管理(@lzu.edu.cn等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.04793.pdf)|[[SqueezeAttention]](https:\u002F\u002Fgithub.com\u002Fhetailang\u002FSqueezeAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhetailang\u002FSqueezeAttention.svg?style=social) |⭐️⭐️ |\n|2024.04|[SnapKV] SnapKV：LLM在生成之前就知道你在寻找什么(@UIUC)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.14469)|[[SnapKV]](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FSnapKV) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FSnapKV.svg?style=social)|⭐️ |\n|2024.05|🔥[vAttention] vAttention：无需分页注意力即可为LLM服务提供动态内存管理(@Microsoft Research India)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04437)|[[vAttention]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fvattention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fvattention.svg?style=social)|⭐️⭐️ |\n|2024.05|🔥[KVCache-1Bit] KV缓存每通道仅1比特：通过联合量化实现高效的大型语言模型推理(@Rice University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.03917)|⚠️|⭐️⭐️ |\n|2024.05|🔥[KV-Runahead] KV-Runahead：通过并行生成键值缓存实现可扩展的因果LLM推理(@Apple等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05329)|⚠️|⭐️⭐️ |\n|2024.05|🔥[ZipCache] ZipCache：结合显著令牌识别实现精准高效的KV缓存量化(@Zhejiang University等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14256)|⚠️|⭐️⭐️ |\n|2024.05|🔥[MiniCache] MiniCache：面向大型语言模型的深度维度KV缓存压缩(@ZIP Lab)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14366)|⚠️|⭐️⭐️ |\n|2024.05|🔥[CacheBlend] CacheBlend：通过融合缓存知识实现快速大型语言模型服务(@University of Chicago)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.16444)|[[LMCache]](https:\u002F\u002Fgithub.com\u002FLMCache\u002FLMCache) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLMCache\u002FLMCache.svg?style=social)|⭐️⭐️ |\n|2024.06|🔥[CompressKV] 有效压缩LLM的KV头(@alibaba等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.07056)|⚠️|⭐️⭐️ |\n|2024.06|🔥[MemServe] MemServe：采用弹性内存池实现去中心化LLM服务的上下文缓存(@Huawei Cloud等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.17565)|⚠️|⭐️⭐️ |\n|2024.07|🔥[MLKV] MLKV：用于内存高效Transformer解码的多层键值头(@Institut Teknologi Bandung)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.09297)|[[pythia-mlkv]](https:\u002F\u002Fgithub.com\u002Fzaydzuhri\u002Fpythia-mlkv) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzaydzuhri\u002Fpythia-mlkv.svg?style=social)|⭐️ |\n|2024.07|🔥[ThinK] ThinK：通过查询驱动的剪枝使键缓存更薄(@Salesforce AI Research等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.21018)|⚠️|⭐️⭐️ |\n|2024.07|🔥[Palu] Palu：利用低秩投影压缩KV缓存(@nycu.edu.tw)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.21118)|[[Palu]](https:\u002F\u002Fgithub.com\u002Fshadowpa0327\u002FPalu) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshadowpa0327\u002FPalu.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥[零延迟QKV压缩] 在LLM推理中缓解KV缓存和网络瓶颈的零延迟QKV压缩(@University of Virginia)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.04107)|⚠️|⭐️⭐️ |\n|2024.09|🔥[**AlignedKV**] AlignedKV：通过精度对齐量化降低KV缓存的内存访问(@Tsinghua University)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.16546)|[[AlignedKV]](https:\u002F\u002Fgithub.com\u002FAlignedQuant\u002FAlignedKV) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlignedQuant\u002FAlignedKV.svg?style=social)|⭐️ |\n|2024.10|🔥[**LayerKV**] 通过逐层KV缓存管理优化大型语言模型服务(@Ant Group)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.00428)|⚠️|⭐️⭐️ |\n|2024.10|🔥[**AdaKV**] Ada-KV：通过自适应预算分配优化KV缓存淘汰，实现高效LLM推理(@USTC)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11550)|[[AdaKV]](https:\u002F\u002Fgithub.com\u002FFFY0\u002FAdaKV) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFFY0\u002FAdaKV.svg?style=social&label=Star)|⭐️⭐️|\n|2024.11|🔥[**KV缓存重计算**] 通过I\u002FO感知的部分KV缓存重计算实现高效LLM推理(@University of Southern California)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.17089)|⚠️|⭐️⭐️ |\n|2024.12|🔥[**ClusterKV**] ClusterKV：在语义空间中操作LLM的KV缓存以实现可召回的压缩(@sjtu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03213)|⚠️|⭐️⭐️ |\n|2024.12|🔥[**DynamicKV**] DynamicKV：面向长上下文LLMs的任务感知自适应KV缓存压缩(@xiabinzhou0625等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.14838)|⚠️|⭐️⭐️ |\n|2025.02|🔥[**DynamicLLaVA**] [ICLR2025] Dynamic-LLaVA：通过动态视觉-语言上下文稀疏化实现高效多模态大型语言模型(@ECNU, Xiaohongshu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.00876)|[[DynamicLLaVA]](https:\u002F\u002Fgithub.com\u002FOsilly\u002Fdynamic_llava) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOsilly\u002Fdynamic_llava.svg?style=social&label=Star)|⭐️⭐️|\n|2025.02|🔥[**CacheCraft**] Cache-Craft：管理分块缓存以实现高效的检索增强生成(@Adobe Research)|[[pdf]](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2502.15734)|⚠️|⭐️⭐️ |\n|2025.04|🔥[**KV缓存预取**] 通过异步KV缓存预取提升LLM推理吞吐量(@Alibaba)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.06319)|⚠️|⭐️⭐️ |\n|2025.05|🔥[**KVzip**] KVzip：具有上下文重建能力的查询无关KV缓存压缩(@SNU)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23416)|[[KVzip]](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FKVzip) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FKVzip.svg?style=social&label=Star)|⭐️⭐️|\n|2025.06|🔥🔥[**推理时超规模扩展**] 结合KV缓存压缩实现推理时的超规模扩展(@NVIDIA)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.05345)|⚠️|⭐️⭐️ |\n|2026.03|[**AVP**] 代理向量协议：通过词汇媒介投影实现跨模型KV缓存传输(@VectorArc)|[[规范]](https:\u002F\u002Fgithub.com\u002FVectorArc\u002Favp-spec)|[[avp-python]](https:\u002F\u002Fgithub.com\u002FVectorArc\u002Favp-python) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVectorArc\u002Favp-python.svg?style=social)|⭐️⭐️ |\n\n### 📖提示\u002F上下文\u002FKV压缩（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"Context-Compression\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.04|🔥[**选择性上下文**] 压缩上下文以提升大语言模型的推理效率（萨里大学） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.06201.pdf)|[Selective-Context](https:\u002F\u002Fgithub.com\u002Fliyucheng09\u002FSelective_Context)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliyucheng09\u002FSelective_Context.svg?style=social)|⭐️⭐️ |\n|2023.05|[**AutoCompressor**] 适配语言模型以压缩上下文（普林斯顿大学） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14788.pdf)|[AutoCompressor](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FAutoCompressors)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-nlp\u002FAutoCompressors.svg?style=social)|⭐️ |\n|2023.10|🔥[**LLMLingua**] LLMLingua：通过压缩提示加速大语言模型的推理（微软） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.05736.pdf)|[LLMLingua](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua.svg?style=social)|⭐️⭐️ |\n|2023.10|🔥🔥[**LongLLMLingua**] LongLLMLingua：通过提示压缩在长上下文场景下加速并增强大语言模型性能（微软） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06839)|[LLMLingua](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua.svg?style=social)|⭐️⭐️ |\n|2024.03|🔥[**LLMLingua-2**] LLMLingua-2：用于高效且忠实的任务无关提示压缩的数据蒸馏（微软） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.12968.pdf)|[LLMLingua系列](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua.svg?style=social)|⭐️ |\n|2024.08|🔥🔥[**500xCompressor**] 500xCompressor：面向大语言模型的通用提示压缩（剑桥大学） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.03094) | ⚠️ |⭐️⭐️ |\n|2024.08|🔥🔥[**特征注意力**] 特征注意力：基于低秩空间的注意力机制用于KV缓存压缩（普渡大学） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.05646) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥🔥[**提示压缩**] 基于上下文感知句子编码的提示压缩，用于快速且改进的大语言模型推理（Alterra AI）| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.01227) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥🔥[**上下文蒸馏**] 高效的大语言模型上下文蒸馏（佐治亚理工学院）| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.01930) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥🔥[**CRITIPREFILL**] CRITIPREFILL：一种基于片段关键性的预填充加速方法，适用于大语言模型（OPPO） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.12490) | [CritiPrefill](https:\u002F\u002Fgithub.com\u002F66RING\u002FCritiPrefill)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F66RING\u002FCritiPrefill.svg?style=social)|⭐️ |\n|2024.10|🔥🔥[**KV-COMPRESS**] 基于分页的KV缓存压缩，按注意力头设置可变压缩率（Cloudflare公司）| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.00161) | [vllm-kvcompress](https:\u002F\u002Fgithub.com\u002FIsaacRe\u002Fvllm-kvcompress) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIsaacRe\u002Fvllm-kvcompress.svg?style=social)|⭐️⭐️ |\n|2024.10|🔥🔥[**LORC**] 基于渐进式压缩策略的大语言模型KV缓存低秩压缩（佐治亚理工学院）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.03111)|⚠️ |⭐️⭐️ |\n|2025.11|🔥🔥[**KVTC**] KV缓存变换编码，用于大语言模型推理中的紧凑存储（NVIDIA）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.01815)|⚠️|⭐️⭐️ |\n\n### 📖长上下文注意力\u002FKV缓存优化（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"Long-Context-Attention-KVCache\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.05|🔥🔥[**分块注意力**] 用于大上下文模型的分块并行Transformer(@UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19370.pdf) | ⚠️ |⭐️⭐️ |\n|2023.05|🔥[地标注意力] Transformer的随机访问无限上下文长度(@epfl.ch)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16300.pdf)|[landmark-attention](https:\u002F\u002Fgithub.com\u002Fepfml\u002Flandmark-attention\u002F)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fepfml\u002Flandmark-attention.svg?style=social)|⭐️⭐️ |\n|2023.07|🔥[**LightningAttention-1**] TRANSNORMERLLM：一种更快更好的大型语言模型，采用改进的TRANSNORMER(@OpenNLPLab)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.14995.pdf)|[TransnormerLLM](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FTransnormerLLM)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenNLPLab\u002FTransnormerLLM.svg?style=social)|⭐️⭐️ |\n|2023.07|🔥[**LightningAttention-2**] Lightning Attention-2：处理大型语言模型中无限序列长度的免费午餐(@OpenNLPLab)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.04658.pdf)|[lightning-attention](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002Flightning-attention)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenNLPLab\u002Flightning-attention.svg?style=social)|⭐️⭐️ |\n|2023.10|🔥🔥[**环形注意力**] 基于分块Transformer的环形注意力，实现近乎无限的上下文(@UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01889.pdf)| [[RingAttention]](https:\u002F\u002Fgithub.com\u002Flhao499\u002FRingAttention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flhao499\u002FRingAttention.svg?style=social)|⭐️⭐️ |\n|2023.11|🔥[**超注意力**] 超注意力：近线性时间内的长上下文注意力(@yale&Google)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.05869.pdf)|[hyper-attn](https:\u002F\u002Fgithub.com\u002Finsuhan\u002Fhyper-attn)  ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finsuhan\u002Fhyper-attn.svg?style=social)|⭐️⭐️ |\n|2023.11|[**流式注意力**] 一次遍历的流式算法，用于在亚线性空间中近似超长标记的注意力(@Adobe Research等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.14652.pdf)|⚠️ |⭐️ |\n|2023.11|🔥[**提示缓存**] 提示缓存：用于低延迟推理的模块化注意力复用(@耶鲁大学等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04934.pdf)|⚠️|⭐️⭐️ |\n|2023.11|🔥🔥[**条纹注意力**] 条纹注意力：因果Transformer的更快环形注意力(@MIT等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.09431.pdf) |[[striped_attention]](https:\u002F\u002Fgithub.com\u002Fexists-forall\u002Fstriped_attention\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fexists-forall\u002Fstriped_attention.svg?style=social) |⭐️⭐️ |\n|2024.01|🔥🔥[**KV量化**] KVQuant：通过KV缓存量化实现千万级上下文长度的LLM推理(@UC Berkeley)|[[pdf]](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2401.18079.pdf)|[[KVQuant]](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FKVQuant\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FKVQuant.svg?style=social) |⭐️⭐️ |\n|2024.02|🔥[**中继注意力**] 中继注意力：用于高效服务具有长系统提示的大语言模型(@sensetime.com等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.14808.pdf) | ⚠️ |⭐️⭐️ |\n|2024.04|🔥🔥[Infini-attention] 不留任何上下文：使用Infini-attention的高效无限上下文Transformer(@Google) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.07143.pdf) | ⚠️ |⭐️⭐️ |\n|2024.04|🔥🔥[RAGCache] RAGCache：用于检索增强生成的高效知识缓存(@北京大学&字节跳动公司) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.12457.pdf) | ⚠️ |⭐️⭐️ |\n|2024.04|🔥🔥[**KCache**] 使用KCache实现高效LLM推理(@Qiaozhi He, Zhihua Wu)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.18057) | ⚠️ |⭐️⭐️ |\n|2024.04|🔥[**HOMER**] 层次化上下文融合：提升预训练LLM的长上下文理解能力(@KAIST)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.10308)|[[HOMER]](https:\u002F\u002Fgithub.com\u002Falinlab\u002FHOMER) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falinlab\u002FHOMER?style=social) |⭐️⭐️ |\n|2024.05|🔥🔥[**YOCO**] 你只需缓存一次：用于语言模型的解码器-解码器架构(@微软)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05254) | [[unilm-YOCO]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002FYOCO) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social) |⭐️⭐️ |\n|2024.05|🔥🔥[SKVQ] SKVQ：用于大型语言模型的滑动窗口键值缓存量化(@上海人工智能实验室)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.06219) | ⚠️ |⭐️⭐️ |\n|2024.05|🔥🔥[**CLA**] 通过跨层注意力减少Transformer的键值缓存大小(@MIT-IBM)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.12981) | ⚠️ |⭐️⭐️ |\n|2024.06|🔥[LOOK-M] LOOK-M：KV缓存中的“看一次”优化，用于高效多模态长上下文推理(@osu.edu等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.18139) | [[LOOK-M]](https:\u002F\u002Fgithub.com\u002FSUSTechBruce\u002FLOOK-M) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSUSTechBruce\u002FLOOK-M.svg?style=social) |⭐️⭐️ |\n|2024.06|🔥🔥[**MInference**] MInference 1.0：通过动态稀疏注意力加速长上下文LLM的预填充(@微软等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.02490) | [[MInference]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference.svg?style=social) |⭐️⭐️ |\n|2024.06|🔥🔥[**InfiniGen**] InfiniGen：通过动态KV缓存管理实现大型语言模型的高效生成式推理(@snu)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.19707) | ⚠️ |⭐️⭐️ |\n|2024.06|🔥🔥[**Quest**] Quest：面向高效长上下文LLM推理的查询感知稀疏性(@mit-han-lab等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.10774)| [[Quest]](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002FQuest) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002FQuest.svg?style=social) |⭐️⭐️ |\n|2024.07|🔥[PQCache] PQCache：基于产品量化技术的KV缓存，用于长上下文LLM推理(@PKU等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.12820) | ⚠️ |⭐️⭐️ |\n|2024.08|🔥[**SentenceVAE**] SentenceVAE：通过下一句子预测实现更快、更长、更准确的大型语言模型推理(@TeleAI)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.00655) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥[**InstInfer**] InstInfer：为低成本长上下文LLM推理而设计的存储内注意力卸载(@PKU等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.04992) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥[**检索注意力**] 检索注意力：通过向量检索加速长上下文LLM推理(@microsoft.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.10516)|⚠️|⭐️⭐️ |\n|2024.10|🔥[**ShadowKV**] ShadowKV：用于高吞吐量长上下文LLM推理的阴影KV缓存(@CMU & 字节跳动)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.21465)|[[ShadowKV]](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FShadowKV) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FShadowKV.svg?style=social) |⭐️⭐️ |\n|2025.01|🔥🔥🔥 [**闪电注意力**] MiniMax-01：利用闪电注意力扩展基础模型 | [[报告]](https:\u002F\u002Ffilecdn.minimax.chat\u002F_Arxiv_MiniMax_01_Report.pdf) | [[MiniMax-01]](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FMiniMax-01) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FMiniMax-01.svg?style=social) | ⭐️⭐️ |\n|2025.06|🔥[**REFORM**] 压缩、聚合与重新计算：改革Transformer中的长上下文处理(@KAIST & Amazon等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01215)|⚠️|⭐️⭐️ |\n\n### 📖早退出\u002F中间层解码（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"Early-Exit\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2020.04|[DeeBERT] DeeBERT：用于加速BERT推理的动态早退出(@uwaterloo.ca)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.12993.pdf)|⚠️|⭐️ |\n|2020.04|[FastBERT] FastBERT：一种自蒸馏且具有自适应推理时间的BERT(@PKU)|[[pdf]](https:\u002F\u002Faclanthology.org\u002F2020.acl-main.537.pdf)|[[FastBERT]](https:\u002F\u002Fgithub.com\u002Fautoliuweijie\u002FFastBERT) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fautoliuweijie\u002FFastBERT.svg?style=social)|⭐️ |\n|2021.06|[BERxiT] BERxiT：具有更好微调并可扩展至回归任务的BERT早退出(@uwaterloo.ca)|[[pdf]](https:\u002F\u002Faclanthology.org\u002F2021.eacl-main.8.pdf)|[[berxit]](https:\u002F\u002Fgithub.com\u002Fcastorini\u002Fberxit) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcastorini\u002Fberxit.svg?style=social)|⭐️ |\n|2023.06|🔥[**SkipDecode**] SkipDecode：基于批处理与缓存的自回归跳过解码，用于高效LLM推理(@Microsoft) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02628) |⚠️|⭐️ |\n|2023.10|🔥[**LITE**] 通过LITE指令微调实现中间层解码，加速LLaMA推理(@Arizona State University) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.18581v2.pdf)|⚠️|⭐️⭐️ |\n|2023.12|🔥🔥[**EE-LLM**] EE-LLM：采用3D并行技术的大规模早退出语言模型训练与推理(@alibaba-inc.com) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.04916.pdf)| [[EE-LLM]](https:\u002F\u002Fgithub.com\u002Fpan-x-c\u002FEE-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpan-x-c\u002FEE-LLM.svg?style=social) |⭐️⭐️ |\n|2023.10|🔥[**FREE**] 具有同步并行解码功能的快速稳健自回归语言模型早退出框架(@KAIST AI&AWS AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.05424.pdf)| [[fast_robust_early_exit]](https:\u002F\u002Fgithub.com\u002Fraymin0223\u002Ffast_robust_early_exit) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fraymin0223\u002Ffast_robust_early_exit.svg?style=social) |⭐️⭐️ |\n|2024.02|🔥[**EE-Tuning**] EE-Tuning：一种经济高效且可扩展的早退出大型语言模型微调方案(@alibaba-inc.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.00518)| [[EE-Tuning]](https:\u002F\u002Fgithub.com\u002Fpan-x-c\u002FEE-LLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpan-x-c\u002FEE-LLM.svg?style=social) |⭐️⭐️ |\n|2024.07| [跳过注意力] 注意力就是一切，但在大型语言模型推理时你并不需要全部注意力(@University College London)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.15516)|⚠️|⭐️⭐️ |\n|2024.08| [**KOALA**] KOALA：通过多层草稿头结合对抗学习增强LLM的推测解码(@Dalian University)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.08146)|⚠️|⭐️⭐️ |\n\n### 📖并行解码\u002F采样 ([©️返回👆🏻](#paperlist))\n\u003Cdiv id=\"Parallel-Decoding-Sampling\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2018.11|🔥[**并行解码**] 针对深度自回归模型的分块并行解码(@伯克利&谷歌)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1811.03115.pdf)|⚠️ |⭐️⭐️ |\n|2023.02|🔥[**推测采样**] 利用推测采样加速大语言模型解码(@DeepMind)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01318.pdf)| ⚠️ |⭐️⭐️ |\n|2023.05|🔥[**推测采样**] 通过推测解码实现Transformer的快速推理(@Google Research等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.17192.pdf)| [[LLMSpeculativeSampling]](https:\u002F\u002Fgithub.com\u002Ffeifeibear\u002FLLMSpeculativeSampling) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffeifeibear\u002FLLMSpeculativeSampling.svg?style=social) |⭐️⭐️ |\n|2023.09|🔥[**Medusa**] Medusa：利用多解码头加速LLM生成的简单框架(@Tianle Cai等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.10774.pdf)|[[Medusa]](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FMedusa.svg?style=social)|⭐️⭐️ |\n|2023.10|[**OSD**] 在线推测解码(@UC Berkeley等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.07177.pdf)| ⚠️ |⭐️⭐️|\n|2023.12|[**级联推测**] 用于更快速LLM推理的级联推测草稿(@illinois.edu) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11462.pdf)| ⚠️ |⭐️|\n|2024.02|🔥[LookaheadDecoding] 使用LOOKAHEAD DECODING打破LLM推理的顺序依赖(@UCSD&Google&UC Berkeley)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.02057.pdf)| [[LookaheadDecoding]](https:\u002F\u002Fgithub.com\u002Fhao-ai-lab\u002FLookaheadDecoding) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhao-ai-lab\u002FLookaheadDecoding.svg?style=social) |⭐️⭐️ |\n|2024.02|🔥🔥[**推测解码**] 解码推测解码(@cs.wisc.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01528.pdf)| [Decoding Speculative Decoding](https:\u002F\u002Fgithub.com\u002Fuw-mad-dash\u002Fdecoding-speculative-decoding) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fuw-mad-dash\u002Fdecoding-speculative-decoding.svg?style=social) |⭐️|\n|2024.04|🔥🔥[**TriForce**] TriForce：利用层次化推测解码实现长序列生成的无损加速(@cmu.edu&Meta AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.11912) | [[TriForce]](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FTriForce) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInfini-AI-Lab\u002FTriForce.svg?style=social)|⭐️⭐️ |\n|2024.04|🔥🔥[**隐式迁移**] 通过隐式迁移实现无损大语言模型加速的并行解码(@pku.edu.cn等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.12022.pdf)| ⚠️ |⭐️|\n|2024.05|🔥[指令式解码] 指令微调的大语言模型能够从噪声指令中自我精炼(@KAIST AI)|[[pdf]](https:\u002F\u002Fopenreview.net\u002Fpdf?id=LebzzClHYw)| [[Instructive-Decoding]](https:\u002F\u002Fgithub.com\u002Fjoonkeekim\u002FInstructive-Decoding) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjoonkeekim\u002FInstructive-Decoding.svg?style=social)|⭐️ |\n|2024.05|🔥[S3D] S3D：一种简单且经济高效的低显存GPU自推测解码方案(@lge.com)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.20314)| ⚠️ |⭐️|\n|2024.06|🔥[**并行解码**] 探索与改进分块并行解码中的草稿(@KAIST&Google Research)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.09221)|⚠️ |⭐️⭐️ |\n|2024.07|🔥[多Token推测解码] 多Token联合推测解码以加速大语言模型推理(@加州大学等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.09221)|⚠️ |⭐️⭐️ |\n|2024.08|🔥[Token回收] 化腐朽为神奇：通过Token回收加速大语言模型推理(@ir.hit.edu.cn等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.08696)|⚠️ |⭐️⭐️ |\n|2024.08|🔥[**推测解码**] 具有自适应草稿长度的并行推测解码(@USTC等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11850)|[[PEARL]](https:\u002F\u002Fgithub.com\u002Fsmart-lty\u002FParallelSpeculativeDecoding) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsmart-lty\u002FParallelSpeculativeDecoding.svg?style=social) |⭐️⭐️ |\n|2024.08|🔥[**FocusLLM**] FocusLLM：通过并行解码扩展LLM上下文(@清华大学等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11745)|[[FocusLLM]](https:\u002F\u002Fgithub.com\u002Fleezythu\u002FFocusLLM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fleezythu\u002FFocusLLM.svg?style=social)|⭐️ |\n|2024.08|🔥[**MagicDec**] MagicDec：利用推测解码突破长上下文生成的延迟-吞吐量权衡(@CMU等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11049)|[[MagicDec]](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FMagicDec\u002F) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInfini-AI-Lab\u002FMagicDec.svg?style=social)|⭐️ |\n|2024.08|🔥[**推测解码**] 通过特征采样和部分对齐蒸馏提升无损推测解码性能(@BIT) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15562) | ⚠️ |⭐️⭐️ |\n|2024.09|🔥[**混合推理**] LLM的高效混合推理：基于奖励的Token建模与选择性云端辅助|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.13757)| ⚠️ |⭐️⭐️ |\n|2024.10|🔥[**PARALLELSPEC**] PARALLELSPEC：用于高效推测解码的并行草稿生成器(@腾讯AI实验室等)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.05589)| ⚠️ |⭐️⭐️ |\n|2024.10|🔥[**Fast Best-of-N**] 通过推测拒绝实现快速Best-of-N解码(@CMU等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.20290)| ⚠️ |⭐️⭐️ |\n|2025.06|🔥[**Mamba Drafters**] Mamba Drafters用于推测解码(@KAIST & Amazon等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01206) | ⚠️ |⭐️⭐️ |\n|2025.06|🔥[**STAND**] 无需模型的推测采样加速测试时缩放(@KAIST & Amazon等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04708) | ⚠️ |⭐️⭐️ |\n\n### 📖结构化剪枝\u002FKD\u002F权重稀疏化 ([©️返回👆🏻](#paperlist))\n\u003Cdiv id=\"Structured_Pruning_KD_Weight_Sparse\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.12|[**FLAP**] 基于波动的大型语言模型自适应结构化剪枝(@中国科学院等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11983.pdf)| [[FLAP]](https:\u002F\u002Fgithub.com\u002FCASIA-IVA-Lab\u002FFLAP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCASIA-IVA-Lab\u002FFLAP.svg?style=social)|⭐️⭐️ |\n|2023.12|🔥[**LASER**] 真相就在其中：通过层选择性秩降低提升语言模型推理能力(@mit.edu)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.13558.pdf)| [[laser]](https:\u002F\u002Fgithub.com\u002Fpratyushasharma\u002Flaser) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpratyushasharma\u002Flaser.svg?style=social)|⭐️⭐️ |\n|2023.12|[PowerInfer] PowerInfer：使用消费级GPU实现快速的大规模语言模型服务(@SJTU)|[[pdf]](https:\u002F\u002Fipads.se.sjtu.edu.cn\u002F_media\u002Fpublications\u002Fpowerinfer-20231219.pdf)|[[PowerInfer]](https:\u002F\u002Fgithub.com\u002FSJTU-IPADS\u002FPowerInfer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSJTU-IPADS\u002FPowerInfer.svg?style=social)|⭐️ |\n|2024.01|[**Admm Pruning**] 针对剪枝后大型语言模型的快速且最优权重更新(@fmph.uniba.sk)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.02938.pdf)|[[admm-pruning]](https:\u002F\u002Fgithub.com\u002Ffmfi-compbio\u002Fadmm-pruning) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffmfi-compbio\u002Fadmm-pruning.svg?style=social)|⭐️ |\n|2024.01|[FFSplit] FFSplit：为优化语言模型推理中的准确率-效率权衡而拆分前馈网络(@1莱斯大学等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.04044.pdf) |  ⚠️ |⭐️|\n|2025.03|🔥[**Simba**] 稀疏化的状态空间模型是高效的高速公路网络(@KAIST)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20698)|[[Simba]](https:\u002F\u002Fgithub.com\u002Fwoominsong\u002FSimba) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwoominsong\u002FSimba.svg?style=social)|⭐️ |\n|2025.06|[SDMPrune] SDMPrune：用于高效大型语言模型的自蒸馏MLP剪枝(@CSU)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11120) |[[SDMPrune]](https:\u002F\u002Fgithub.com\u002Fvisresearch\u002FSDMPrune)![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvisresearch\u002FSDMPrune.svg?style=social&label=Star)|⭐️⭐️ |\n|2026.03|[**HFPrune**] 高保真度的大型语言模型剪枝(@CSU)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.08083) |[[HFPrune]](https:\u002F\u002Fgithub.com\u002Fvisresearch\u002FHFPrune)![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvisresearch\u002FHFPrune.svg?style=social&label=Star)|⭐️⭐️ |\n\n### 📖专家混合(MoE) LLM 推理 ([©️返回👆🏻](#paperlist))\n\u003Cdiv id=\"Mixture_of_Experts_LLM_Inference\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2022.11|🔥[**WINT8\u002F4**] 谁说大象不能跑？将大规模MoE模型引入云端生产环境(@NVIDIA&Microsoft) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10017.pdf)|[[FasterTransformer]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FFasterTransformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FFasterTransformer.svg?style=social)|⭐️⭐️ |\n|2023.12|🔥 [**Mixtral Offloading**] 借助卸载技术实现专家混合语言模型的快速推理(@莫斯科物理技术研究所等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.17238.pdf)| [[mixtral-offloading]](https:\u002F\u002Fgithub.com\u002Fdvmazur\u002Fmixtral-offloading) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvmazur\u002Fmixtral-offloading.svg?style=social)|⭐️⭐️ |\n|2024.01| [MoE-Mamba] MoE-Mamba：结合专家混合的高效选择性状态空间模型(@uw.edu.pl) |  [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.04081.pdf)| ⚠️ |⭐️|\n|2024.04| [MoE Inference] 向推理最优的专家混合大型语言模型迈进(@UC San Diego等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.02852.pdf)| ⚠️ |⭐️|\n|2024.05| 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2：一款强大、经济且高效的专家混合语言模型(@DeepSeek-AI)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.04434) | [[DeepSeek-V2]](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V2) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V2.svg?style=social)| ⭐️⭐️ |\n|2024.06| [MoE] 关于专家混合的综述(@HKU) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.06204)| ⚠️ |⭐️|\n\n### 📖CPU\u002F单GPU\u002FFPGA\u002FNPU\u002F移动端推理 ([©️返回👆🏻](#paperlist))\n\u003Cdiv id=\"CPU-Single-GPU-Inference\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.03|[FlexGen] 单GPU高效生成式大模型推理(@斯坦福大学等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.06865.pdf)|[[FlexGen]](https:\u002F\u002Fgithub.com\u002FFMInference\u002FFlexGen) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FFlexGen.svg?style=social)|⭐️ |\n|2023.11|[LLM CPU推理] 在CPU上高效运行LLM(@英特尔)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.00502.pdf)| [[intel-extension-for-transformers]](https:\u002F\u002Fgithub.com\u002Fintel\u002Fintel-extension-for-transformers) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fintel\u002Fintel-extension-for-transformers.svg?style=social) |⭐️ |\n|2023.12|[LinguaLinked] LinguaLinked：面向移动设备的分布式大模型推理系统(@加州大学欧文分校)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00388.pdf)|⚠️ |⭐️ |\n|2023.12|[OpenVINO] 利用OpenVINO结合推测采样与KV缓存优化进行生成式AI推理(@Haim Barad等) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04951.pdf)|⚠️|⭐️ |\n|2024.03|[FlightLLM] FlightLLM：基于FPGA完整映射流程的高效大模型推理(@Infinigence-AI) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.03868.pdf)|⚠️|⭐️ |\n|2024.03|[Transformer-Lite] Transformer-Lite：在手机GPU上高效部署大模型(@OPPO) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2403\u002F2403.20041.pdf)|⚠️|⭐️ |\n|2024.07|🔥🔥[**xFasterTransformer**] 在CPU上优化大模型推理性能(@英特尔) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.07304)|[[xFasterTransformer]](https:\u002F\u002Fgithub.com\u002Fintel\u002FxFasterTransformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fintel\u002FxFasterTransformer.svg?style=social) |⭐️ |\n|2024.07| [综述] 基础模型在AI加速器上的推理优化(@AWS AI) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.09111)|⚠️|⭐️ |\n|2024.10| 移动平台上的大模型性能基准测试：全面评估(@中山大学) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.03613)|⚠️|⭐️ |\n|2024.10|🔥🔥[**FastAttention**] FastAttention：将FlashAttention2扩展至NPU和低资源GPU以实现高效推理(@华为等)| [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.16663)|⚠️|⭐️ |\n|2024.12|🔥🔥[**NITRO**] NITRO：在英特尔®笔记本NPU上进行LLM推理(@康奈尔大学)|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.11053)|[[nitro]](https:\u002F\u002Fgithub.com\u002Fabdelfattah-lab\u002Fnitro) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fabdelfattah-lab\u002Fnitro.svg?style=social)|⭐️ |\n|2025.01|📱[**Off Grid**] 面向iOS和Android的端侧LLM+视觉+图像生成(@alichherawalla)|[[github]](https:\u002F\u002Fgithub.com\u002Falichherawalla\u002Foff-grid-mobile)|[[off-grid-mobile]](https:\u002F\u002Fgithub.com\u002Falichherawalla\u002Foff-grid-mobile) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falichherawalla\u002Foff-grid-mobile.svg?style=social)|⭐️ |\n|2025.12|🔥[**Grail-V\u002FPSE**] 通过POWER8 vec_perm实现非双射注意力压缩——CPU推理速度提升8.8倍(@Elyan Labs)|[[zenodo]](https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.14862410)|[[ram-coffers]](https:\u002F\u002Fgithub.com\u002FScottcjn\u002Fram-coffers) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FScottcjn\u002Fram-coffers.svg?style=social)|⭐️ |\n|2025.12|🔥[**llama-cpp-power8**] llama.cpp的POWER8优化：vec_perm非双射压缩、IBM MASS集成、dcbt驻留预取。相比原生版本提速8.8倍(@Scottcjn)|[[github]](https:\u002F\u002Fgithub.com\u002FScottcjn\u002Fllama-cpp-power8)|[[llama-cpp-power8]](https:\u002F\u002Fgithub.com\u002FScottcjn\u002Fllama-cpp-power8) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FScottcjn\u002Fllama-cpp-power8.svg?style=social)|⭐️ |\n|2025.12|🔥[**RAM Coffers**] 面向LLM推理的NUMA感知权重银行。将大脑半球认知功能映射到NUMA拓扑，实现智能路由与选择性预取(@Scottcjn)|[[github]](https:\u002F\u002Fgithub.com\u002FScottcjn\u002Fram-coffers)|[[ram-coffers]](https:\u002F\u002Fgithub.com\u002FScottcjn\u002Fram-coffers) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FScottcjn\u002Fram-coffers.svg?style=social)|⭐️ |\n\n### 📖非Transformer架构 ([©️返回👆🏻](#paperlist))\n\u003Cdiv id=\"Non-Transformer-Architecture\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2023.05|🔥🔥[**RWKV**] RWKV：为Transformer时代重新发明RNN(@Bo Peng等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13048.pdf)|[[RWKV-LM]](https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBlinkDL\u002FRWKV-LM.svg?style=social)|⭐️⭐️ |\n|2023.12|🔥🔥[**Mamba**] Mamba：具有选择性状态空间的线性时间序列建模(@卡内基梅隆大学等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00752.pdf)|[[mamba]](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fstate-spaces\u002Fmamba.svg?style=social)|⭐️⭐️ |\n|2024.06|🔥🔥[**RWKV-CLIP**] RWKV-CLIP：鲁棒的视觉-语言表征学习者(@DeepGlint等) |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.06973)|[[RWKV-CLIP]](https:\u002F\u002Fgithub.com\u002Fdeepglint\u002FRWKV-CLIP) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepglint\u002FRWKV-CLIP.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥🔥[Kraken] Kraken：固有并行的Transformer，用于高效的多设备推理(@普林斯顿大学) | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.07802)|⚠️|⭐️ |\n|2024.08|🔥🔥[**FLA**] FLA：基于Triton的线性注意力机制硬件高效实现库(@sustcsonglin)| [[docs]](https:\u002F\u002Fgithub.com\u002Fsustcsonglin\u002Fflash-linear-attention) |[[flash-linear-attention]](https:\u002F\u002Fgithub.com\u002Fsustcsonglin\u002Fflash-linear-attention) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsustcsonglin\u002Fflash-linear-attention.svg?style=social)|⭐️⭐️ |\n\n### 📖GEMM\u002F张量核心\u002FMMA\u002F并行（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"GEMM-Tensor-Cores-WMMA\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2018.03|🔥🔥[张量核心] NVIDIA 张量核心的可编程性、性能与精度（@KTH皇家理工学院等） |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.04014.pdf)|⚠️|⭐️ |\n|2021.05|🔥[SM内并行] 通过持久化与弹性块挖掘 GPU 中 SM 内并行性（@上海交通大学）|[[pdf]](https:\u002F\u002Fmivenhan.github.io\u002Fpublication\u002F2021plasticine\u002F2021plasticine.pdf)|⚠️|⭐️ |\n|2022.06|[微基准测试] 利用微基准测试剖析张量核心：延迟、吞吐量与数值行为（@荷兰特文特大学等） |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02874.pdf)|[[DissectingTensorCores]](https:\u002F\u002Fgithub.com\u002Fsunlex0717\u002FDissectingTensorCores) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsunlex0717\u002FDissectingTensorCores.svg?style=social)|⭐️ |\n|2022.09|🔥🔥[FP8] 用于深度学习的 FP8 格式（@NVIDIA） |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05433.pdf)|⚠️|⭐️ |\n|2023.08|🔥[张量核心] 减少共享内存占用以充分利用张量核心的高吞吐量，及其灵活的 API 扩展库（@东京工业大学等） |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.15152.pdf)|[[wmma_extension]](https:\u002F\u002Fgithub.com\u002Fwmmae\u002Fwmma_extension) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwmmae\u002Fwmma_extension.svg?style=social)|⭐️ |\n|2023.03|🔥🔥[**cutlass\u002Fcute**] Graphene：面向 GPU 上优化张量计算的中间表示语言（@NVIDIA）|[[pdf]](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3582016.3582018)|[[cutlass]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002Fcutlass.svg?style=social)|⭐️ |\n|2024.02|[QUICK] QUICK：面向高效 LLM 推理的量化感知交错与无冲突核函数（@SqueezeBits 公司）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.10076.pdf)|[[QUICK]](https:\u002F\u002Fgithub.com\u002FSqueezeBits\u002FQUICK) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeBits\u002FQUICK.svg?style=social)|⭐️⭐️ |\n|2024.02|[张量并行] TP-AWARE DEQUANTIZATION（@IBM T.J. 沃森研究中心）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.04925.pdf)|⚠️|⭐️ |\n|2024.07|🔥🔥[**flute**] 面向查表量化 LLM 的快速矩阵乘法（@麻省理工学院等） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.10960)|[[flute]](https:\u002F\u002Fgithub.com\u002FHanGuo97\u002Fflute) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHanGuo97\u002Fflute.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥🔥[**LUT TENSOR CORE**] LUT 张量核心：查找表实现高效低比特 LLM 推理加速（@上海交通大学&北京大学等）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.06003)|⚠️|⭐️ |\n|2024.08|🔥🔥[**MARLIN**] MARLIN：大型语言模型上的混合精度自回归并行推理（@ISTA） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11743)|[[marlin]](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fmarlin) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIST-DASLab\u002Fmarlin.svg?style=social)|⭐️⭐️ |\n|2024.08|🔥🔥[**SpMM**] 使用张量核心进行高性能非结构化稀疏矩阵-矩阵乘法计算（@苏黎世联邦理工学院）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11551)|⚠️|⭐️ |\n|2024.09|🔥🔥[**TEE**] 基于 nVIDIA H100 GPU 的机密计算：性能基准研究（@phala.network）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.03992)|⚠️|⭐️ |\n|2024.09|🔥🔥[**HiFloat8**] 华为 Ascend HiFloat8 格式用于深度学习（@华为）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.16626)|⚠️|⭐️ |\n|2024.09|🔥🔥[**张量核心**] 面向 GPU 张量核心的大型语言模型任意精度高效加速（@南京大学）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.17870)|⚠️|⭐️ |\n|2024.07|🔥🔥[**张量积**] 利用张量核心加速张量积运算（@海德堡大学）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.09621)|⚠️|⭐️ |\n|2024.12| 🔥🔥[**HADACORE**] HADACORE：张量核心加速的哈达玛变换核函数（@Meta）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.09621)|[[hadamard_transform]](https:\u002F\u002Fgithub.com\u002Fpytorch-labs\u002Fapplied-ai\u002Ftree\u002Fmain\u002Fkernels\u002Fcuda\u002Finference\u002Fhadamard_transform) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpytorch-labs\u002Fapplied-ai.svg?style=social)|⭐️ |\n|2024.10| 🔥🔥[**FLASH-ATTENTION RNG**] 通过将随机数生成器隐藏在 GEMM 中来降低 Flash-Attention 中 Dropout 的开销（@普林斯顿大学）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.07531)|⚠️|⭐️ |\n|2025.02| 🔥🔥[**TRITONBENCH**] TRITONBENCH：用于生成 Triton 运算符的大语言模型能力基准测试（@thunlp） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.14752) | [[TritonBench]](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FTritonBench) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthunlp\u002FTritonBench.svg?style=social)|⭐️⭐️ |\n|2025.04| 🔥🔥[**Triton-distributed**] TileLink：利用以瓦片为中心的原语生成高效的计算-通信重叠核函数（@字节跳动-Seed） | [[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.20313) | [[Triton-distributed]](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FTriton-distributed) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByteDance-Seed\u002FTriton-distributed.svg?style=social)|⭐️⭐️ |\n\n### 📖VLM\u002F位置嵌入\u002F其他（[©️返回👆🏻](#paperlist)）\n\u003Cdiv id=\"Others\">\u003C\u002Fdiv>\n\n|日期|标题|论文|代码|推荐|\n|:---:|:---:|:---:|:---:|:---:|\n|2021.04|🔥[RoPE] ROFORMER：带有旋转位置嵌入的增强型 Transformer（@追一科技有限公司） |[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.09864.pdf)|[[transformers]](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmodel_doc\u002Froformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Ftransformers.svg?style=social)|⭐️ |\n|2022.10|[ByteTransformer] 针对变长输入优化的高性能 Transformer（@字节跳动&NVIDIA）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03052.pdf)|[[ByteTransformer]](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FByteTransformer) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FByteTransformer.svg?style=social)|⭐️ |\n|2024.09|🔥[**Inf-MLLM**] Inf-MLLM：单 GPU 上多模态大语言模型的高效流式推理（@sjtu）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.09086)|⚠️|⭐️ |\n|2024.11|🔥[VL-CACHE] VL-CACHE：面向视觉-语言模型推理加速的稀疏性和模态感知 KV 缓存压缩（@加州大学洛杉矶分校等）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.23317)|⚠️|⭐️ |\n|2025.02| 🔥[**DynamicLLaVA**] [ICLR2025] Dynamic-LLaVA：通过动态视觉-语言上下文稀疏化实现高效多模态大语言模型（@华东师范大学、小红书）|[[pdf]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.00876)|[[DynamicLLaVA]](https:\u002F\u002Fgithub.com\u002FOsilly\u002Fdynamic_llava) ![](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOsilly\u002Fdynamic_llava.svg?style=social&label=Star)|⭐️⭐️|\n\n## ©️许可证\n\nGNU 通用公共许可证 v3.0\n\n## 🎉贡献\n\n欢迎给这个仓库点个赞并提交 Pull Request！\n\n\u003Cdiv align='center'>\n  \u003Cimg width=\"450\" height=\"250\" alt=\"v02\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_e21369547821.png\">\n\u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#xlite-dev\u002FAwesome-LLM-Inference&Date\">\n  \u003Cpicture align='center'>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_076fb9e294ec.png&theme=dark\" \u002F>\n    \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_076fb9e294ec.png\" \u002F>\n    \u003Cimg width=\"350\" height=\"250\" alt=\"Star历史图表\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_readme_076fb9e294ec.png\" \u002F>\n  \u003C\u002Fpicture>\n\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n---\n\n### Elyan Labs生态系统的一部分\n\n- [BoTTube](https:\u002F\u002Fbottube.ai) — AI视频平台，拥有119+个智能体共同创作内容\n- [RustChain](https:\u002F\u002Frustchain.org) — 基于硬件证明的古老性证明区块链\n- [GitHub](https:\u002F\u002Fgithub.com\u002FScottcjn)","# Awesome-LLM-Inference 快速上手指南\n\nAwesome-LLM-Inference 是一个精选的大语言模型（LLM）推理优化论文与代码资源库。它并非一个单一的推理引擎，而是一个汇集了最新推理技术（如量化、注意力机制优化、显存管理、并行策略等）的索引列表，旨在帮助开发者快速定位并复现高效的 LLM 推理方案。\n\n## 环境准备\n\n本项目主要作为资源索引和文档集合，核心依赖为 Python 环境用于运行下载脚本。若要复现列表中具体的论文代码，需根据对应项目的要求配置特定的深度学习环境（通常涉及 PyTorch, CUDA, Triton 等）。\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+), macOS, Windows\n*   **Python 版本**: Python 3.8 或更高版本\n*   **前置依赖**:\n    *   Git (用于克隆仓库)\n    *   Python3 & pip\n    *   (可选) CUDA Toolkit (若需编译或运行特定 GPU 加速代码)\n\n## 安装步骤\n\n### 1. 克隆仓库\n使用 Git 将项目克隆到本地：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference.git\ncd Awesome-LLM-Inference\n```\n\n> **国内加速建议**：如果访问 GitHub 速度较慢，可使用国内镜像源（如 Gitee 镜像，若有）或通过代理加速克隆过程。\n\n### 2. 下载论文合集 (可选)\n项目提供了一键下载所有收录论文 PDF 的脚本。确保已安装 Python3，然后运行：\n\n```bash\npython3 download_pdfs.py\n```\n\n*注：该脚本由 Doubao AI 生成，用于批量获取资源库中提到的技术论文。*\n\n## 基本使用\n\n由于本项目是资源列表，\"使用\"的核心在于查阅目录并跳转至具体技术的实现仓库。\n\n### 1. 浏览分类资源\n打开本地的 `README.md` 文件或访问 GitHub 页面，根据需求查看以下核心分类：\n\n*   **热门趋势 (Trending Topics)**: 关注最新的 Sora, DeepSeek-V2\u002FV3\u002FR1, FlashAttention-3, Mooncake 等前沿技术。\n*   **架构优化**: 查找 MLA (Multi-head Latent Attention), MoE (Mixture-of-Experts) 等相关实现。\n*   **推理加速**: 检索连续批处理 (Continuous Batching), KV Cache 优化，量化 (Quantization) 方案。\n*   **并行策略**: 了解多卡\u002F多机并行 (DP, TP, PP, SP) 的最新进展。\n\n### 2. 获取具体代码示例\n在表格中找到感兴趣的技术（例如 `FlashAttention-3` 或 `DeepSeek-V3`），点击 **Code** 列对应的链接进入官方实现仓库。\n\n**示例：尝试复现 FlashAttention-3**\n1.  在列表中找到 `FlashAttention-3` 条目。\n2.  点击代码链接进入 [Dao-AILab\u002Fflash-attention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention)。\n3.  按照该子项目的 README 进行安装和使用：\n\n```bash\n# 以下为子项目典型安装命令示例，具体请以子项目文档为准\npip install flash-attn --no-build-isolation\n```\n\n### 3. 利用综述文档学习\n项目发布过一份名为 `Awesome LLM Inference for Beginners.pdf` 的入门综述（可在 Releases 页面下载），涵盖了 FastServe, PagedAttention, SmoothQuant 等基础概念的详细解读，适合初学者系统学习。","某初创团队正在开发一款基于大模型的实时法律问答助手，需要在有限的单卡 GPU 服务器上实现高并发、低延迟的推理服务。\n\n### 没有 Awesome-LLM-Inference 时\n- **技术选型迷茫**：面对 FlashAttention、PagedAttention、WINT8 等数十种优化论文，团队花费数周手动筛选，难以判断哪些代码已开源且适合当前硬件。\n- **性能瓶颈难破**：自行实现的注意力机制显存占用过高，导致批量处理（Batching）能力极弱，用户稍多即出现显存溢出（OOM）。\n- **量化落地困难**：尝试引入 INT4 量化时，因缺乏成熟的 WINT4\u002FSmoothQuant 参考实现，模型精度严重下降，无法满足法律场景的准确性要求。\n- **并行策略缺失**：面对长上下文请求，不懂如何应用 Ring Attention 或 USP 混合并行策略，导致首字延迟高达数秒，用户体验极差。\n\n### 使用 Awesome-LLM-Inference 后\n- **精准快速落地**：直接查阅分类清单，迅速定位到带有完整代码的 PagedAttention 和 Continuous Batching 方案，将技术验证周期从数周缩短至两天。\n- **显存效率倍增**：集成清单推荐的 KV Cache 优化算法，显存利用率提升 3 倍，成功支持更大 Batch Size，并发吞吐量显著增加。\n- **无损量化部署**：复用 AWQ 和 WINT8\u002F4 的成熟实现，在模型体积压缩 75% 的同时，保持了法律条文引用的准确度，顺利在单卡运行。\n- **长文本流畅响应**：应用 Long Context 优化策略，结合早期退出（Early-Exit）技术，将长文档分析的首字延迟降低至毫秒级，响应丝滑流畅。\n\nAwesome-LLM-Inference 充当了大模型推理优化的“导航图”，让团队免于重复造轮子，直接站在前沿算法的肩膀上构建高性能应用。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fxlite-dev_Awesome-LLM-Inference_de7e0850.png","xlite-dev","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fxlite-dev_1feaae80.png","Develop ML\u002FAI toolkits and ML\u002FAI\u002FCUDA Learning resources.",null,"https:\u002F\u002Fgithub.com\u002Fxlite-dev",[82],{"name":83,"color":84,"percentage":85},"Python","#3572A5",100,5124,358,"2026-04-04T16:16:18","GPL-3.0","未说明","未说明 (项目为论文与代码合集，具体需求取决于列表中各子项目的实现，通常涉及 NVIDIA GPU 及 CUDA)",{"notes":93,"python":94,"dependencies":95},"该项目是一个精选列表（Awesome List），汇集了 LLM 推理相关的论文和对应代码仓库，本身不是一个可直接运行的单一推理框架。运行具体工具需参考列表中各子项目（如 DeepSpeed, vLLM, FlashAttention 等）的独立文档。文中提到的下载脚本由 Doubao AI 生成。","3.x (文中下载脚本示例使用 python3)",[90],[13,26],[98,99,100,101,102,103,104,105,106,107,108,109,110,111],"flash-attention","paged-attention","tensorrt-llm","vllm","awesome-llm","llm-inference","deepseek","flash-attention-3","deepseek-v3","minimax-01","deepseek-r1","mla","flash-mla","qwen3","2026-03-27T02:49:30.150509","2026-04-06T06:44:04.497058",[115,120,125,130,135,140,145],{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},10349,"如何向该仓库提交新的论文、代码链接或资源？","欢迎直接提交 Pull Request (PR) 将最新的论文、博客或开源代码添加到仓库中。维护者会审查并合并您的 PR。例如，您可以将新论文添加到阅读列表，或将开源项目的代码链接添加到相应章节（如'LLM Train\u002FInference Framework'）。","https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fissues\u002F45",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},10350,"是否可以将最近的论文或更新排在列表顶部以便追踪？","用户曾建议反转出版物顺序，将最近的放在顶部以便追踪。虽然该特定 Issue 因超时关闭，但这反映了用户对按时间倒序排列内容的需求。建议用户在浏览时关注仓库的最近更新记录或通过 PR 协助整理最新内容。","https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fissues\u002F143",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},10351,"是否有推荐的上下文压缩（Context Compression）方法或论文？","社区推荐了多种上下文压缩方法，包括 LLMLingua、Selective-Context 和 Auto-Compressor。这些方法已被讨论并合并到仓库的阅读列表中，用于优化长上下文处理。","https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fissues\u002F4",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},10352,"有哪些关于 KV Cache 压缩或量化的最新论文推荐？","推荐的论文包括《No Token Left Behind》和《Get More with Less》，这两篇论文探讨了 KV Cache 的压缩与量化技术，旨在提高推理效率。这些论文已通过 PR 合并到仓库中。","https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fissues\u002F5",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},10353,"FlashInfer 是否值得加入 LLM 推理框架列表？","是的，FlashInfer 是一个值得关注的推理加速项目。维护者建议用户通过提交 PR 将其添加到'LLM Train\u002FInference Framework'部分，经过审查后即可合并。","https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fissues\u002F23",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},10354,"是否有官方交流群或社区讨论组？","目前由于时间和精力有限，维护者暂未建立官方的交流群（如微信群）。不过，社区成员可以自发组织非官方群组进行讨论。维护者更鼓励大家通过提交 PR 来分享和讨论 LLM 推理的最新进展。","https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fissues\u002F17",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},10355,"除了论文列表，是否有关于顶尖研究实验室的资源汇总？","有用户建议添加来自大公司或大学的活跃实验室列表（如 MIT HAN Lab, DAO Lab, IST-DASLab），以便用户通过实验室博客获取最新动态。维护者欢迎用户通过 PR 形式补充这些优秀的实验室博客资源。","https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fissues\u002F134",[151,156,161,166,171,176,181,186,191,196,201,206,211,216,221,226,231,236,241,246],{"id":152,"version":153,"summary_zh":154,"released_at":155},107569,"v2.6.20","## What's Changed\r\n* Add 4 papers by @woominsong in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F148\r\n* Update new paper (KVzip) by @Janghyun1230 in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F149\r\n* Add a new paper (GuidedQuant) by @jusjinuk in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F150\r\n* Add STAND by @woominsong in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F151\r\n* Add Inference-Time Hyper-Scaling by @CStanKonrad in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F152\r\n\r\n## New Contributors\r\n* @woominsong made their first contribution in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F148\r\n* @jusjinuk made their first contribution in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F150\r\n* @CStanKonrad made their first contribution in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F152\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.19...v2.6.20","2025-06-17T09:57:32",{"id":157,"version":158,"summary_zh":159,"released_at":160},107570,"v2.6.19","## What's Changed\r\n* 🔥[SageAttention-3] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F147\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.18...v2.6.19","2025-05-27T05:52:33",{"id":162,"version":163,"summary_zh":164,"released_at":165},107571,"v2.6.18","## What's Changed\r\n* Flex Attention: a Programming Model for Generating Optimized Attention Kernels by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F146\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.17...v2.6.18","2025-05-15T06:04:43",{"id":167,"version":168,"summary_zh":169,"released_at":170},107572,"v2.6.17","## What's Changed\r\n* 🔥[BitNet v2] Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F144\r\n* Add The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs by @PiotrNawrot in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F145\r\n\r\n## New Contributors\r\n* @PiotrNawrot made their first contribution in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F145\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.16...v2.6.17","2025-05-06T02:25:28",{"id":172,"version":173,"summary_zh":174,"released_at":175},107573,"v2.6.16","## What's Changed\r\n* Add PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters by @Lizonghang in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F137\r\n* 🔥🔥[SGLang] Efficiently Programming Large Language Models using SGLang by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F138\r\n* 🔥[FSDP 1\u002F2] PyTorch FSDP: Getting Started with Fully Sharded Data Parallel(FSDP) by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F139\r\n* 🔥[MMInference] MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F140\r\n* Update Multi-GPUs\u002FMulti-Nodes Parallelism by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F141\r\n* 🔥[Triton-distributed] TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F142\r\n\r\n## New Contributors\r\n* @Lizonghang made their first contribution in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F137\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.15...v2.6.16","2025-04-27T08:33:14",{"id":177,"version":178,"summary_zh":179,"released_at":180},107574,"v2.6.15","## What's Changed\r\n* MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F131\r\n* TRITONBENCH: Benchmarking Large Language Model Capabilities for Generating Triton Operator by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F132\r\n* 🔥[KV Cache Prefetch] Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F133\r\n* Add SeerAttention and SlimAttention Paper by @sunshinemyson in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F135\r\n\r\n## New Contributors\r\n* @sunshinemyson made their first contribution in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F135\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.14...v2.6.15","2025-04-17T08:08:31",{"id":182,"version":183,"summary_zh":184,"released_at":185},107575,"v2.6.14","## What's Changed\r\n* [feat] add deepseek FlashMLA by @shaoyuyoung in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F120\r\n* Add our ICLR2025 work Dynamic-LLaVA by @Blank-z0 in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F121\r\n* 🔥[MHA2MLA] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F122\r\n* update the title of SageAttention2 and add SpargeAttn by @jt-zhang in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F123\r\n* Add DeepSeek Open Sources modules by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F124\r\n* Update DeepSeek\u002FMLA Topics by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F125\r\n* Request to Add CacheCraft: A Relevant Work on Chunk-Aware KV Cache Reuse by @skejriwal44 in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F126\r\n* 🔥[X-EcoMLA] Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F127\r\n* Add download_pdfs.py by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F128\r\n* Update README.md by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F129\r\n* Update Mooncake-v3 paper link by @DefTruth in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F130\r\n\r\n## New Contributors\r\n* @Blank-z0 made their first contribution in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F121\r\n* @jt-zhang made their first contribution in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F123\r\n* @skejriwal44 made their first contribution in https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fpull\u002F126\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.13...v2.6.14","2025-03-31T04:56:24",{"id":187,"version":188,"summary_zh":189,"released_at":190},107576,"v2.6.13","## What's Changed\r\n* 🔥[DeepSeek-NSA] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F119\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.12...v2.6.13","2025-02-19T11:46:33",{"id":192,"version":193,"summary_zh":194,"released_at":195},107577,"v2.6.12","## What's Changed\r\n* Add Multi-head Latent Attention(MLA) topic by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F118\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.11...v2.6.12","2025-02-13T04:21:30",{"id":197,"version":198,"summary_zh":199,"released_at":200},107578,"v2.6.11","## What's Changed\r\n* add `MiniMax-01` in Trending LLM\u002FVLM Topics and Long Context Attention by @shaoyuyoung in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F112\r\n* [feat] add deepseek-r1 by @shaoyuyoung in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F113\r\n* 🔥🔥[DistServe] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F114\r\n* 🔥🔥[KVDirect] KVDirect: Distributed Disaggregated LLM Inference by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F115\r\n* 🔥🔥[DeServe] DESERVE: TOWARDS AFFORDABLE OFFLINE LLM INFERENCE VIA DECENTRALIZATION by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F116\r\n* 🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F117\r\n\r\n## New Contributors\r\n* @shaoyuyoung made their first contribution in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F112\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.10...v2.6.11","2025-01-31T06:54:17",{"id":202,"version":203,"summary_zh":204,"released_at":205},107579,"v2.6.10","## What's Changed\r\n* 🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F109\r\n* 🔥🔥[SP: TokenRing] TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F110\r\n* 🔥🔥[FFPA] FFPA: Yet another Faster Flash Prefill Attention with O(1) SRAM complexity for headdim > 256, ~1.5x faster than SDPA EA(@DefTruth) by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F111\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.9...v2.6.10","2025-01-06T06:12:00",{"id":207,"version":208,"summary_zh":209,"released_at":210},107580,"v2.6.9","## What's Changed\r\n* 🔥🔥[TurboAttention] TURBOATTENTION: EFFICIENT ATTENTION APPROXIMATION FOR HIGH THROUGHPUTS LLMS by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F105\r\n* 🔥🔥[NITRO] NITRO: LLM INFERENCE ON INTEL® LAPTOP NPUS by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F106\r\n* 🔥[DynamicKV] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F107\r\n* 🔥🔥[HADACORE] HADACORE: TENSOR CORE ACCELERATED HADAMARD TRANSFORM KERNEL by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F108\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.8...v2.6.9","2024-12-22T08:04:24",{"id":212,"version":213,"summary_zh":214,"released_at":215},107581,"v2.6.8","## What's Changed\r\n* 🔥[ClusterKV] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F103\r\n* 🔥[BatchLLM] BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F104\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.7...v2.6.8","2024-12-09T01:22:36",{"id":217,"version":218,"summary_zh":219,"released_at":220},107582,"v2.6.7","## What's Changed\r\n* 🔥[Star-Attention: 11x~ speedup] Star Attention: Efficient LLM Inference over Long Sequences by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F101\r\n* 🔥[KV Cache Recomputation] Efficient LLM Inference with I\u002FO-Aware Partial KV Cache Recomputation by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F102\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.6...v2.6.7","2024-12-02T05:30:13",{"id":222,"version":223,"summary_zh":224,"released_at":225},107583,"v2.6.6","## What's Changed\r\n* Add code link to BPT by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F95\r\n* add vAttention code link by @KevinZeng08 in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F96\r\n* 🔥[SageAttention] SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION(@thu-ml) by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F97\r\n* 🔥[SageAttention-2] SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration(@thu-ml) by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F98\r\n* 🔥[Squeezed Attention] SQUEEZED ATTENTION: Accelerating Long Context Length LLM Inference(@UC Berkeley) by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F99\r\n* 🔥[SparseInfer] SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F100\r\n\r\n## New Contributors\r\n* @KevinZeng08 made their first contribution in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F96\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.5...v2.6.6","2024-11-25T03:22:59",{"id":227,"version":228,"summary_zh":229,"released_at":230},107584,"v2.6.5","## What's Changed\r\n* Add DP\u002FTP\u002FSP\u002FCP papers with codes by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F92\r\n* 🔥🔥[SP: BPT] Blockwise Parallel Transformer for Large Context Models by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F93\r\n* 🔥🔥[TP: Comm Compression] Communication Compression for Tensor Parallel LLM Inference by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F94\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.4...v2.6.5","2024-11-18T02:53:11",{"id":232,"version":233,"summary_zh":234,"released_at":235},107585,"v2.6.4","## What's Changed\r\n* 🔥[BitNet] BitNet a4.8: 4-bit Activations for 1-bit LLMs by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F91\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.3...v2.6.4","2024-11-13T07:02:13",{"id":237,"version":238,"summary_zh":239,"released_at":240},107586,"v2.6.3","## What's Changed\r\n* 🔥[Fast Best-of-N] Fast Best-of-N Decoding via Speculative Rejection by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F89\r\n* 🔥[Tensor Product] Acceleration of Tensor-Product Operations with Tensor Cores by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F90\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.2...v2.6.3","2024-11-01T01:18:27",{"id":242,"version":243,"summary_zh":244,"released_at":245},107587,"v2.6.2","## What's Changed\r\n* early exit of LLM inference by @boyi-liu in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F85\r\n* Add paper AdaKV by @FFY0 in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F86\r\n* Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance by @aharshms in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F87\r\n* 🔥[FastAttention] FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs for Efficient Inference by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F88\r\n\r\n## New Contributors\r\n* @boyi-liu made their first contribution in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F85\r\n* @FFY0 made their first contribution in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F86\r\n* @aharshms made their first contribution in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F87\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6.1...v2.6.2","2024-10-28T02:38:17",{"id":247,"version":248,"summary_zh":249,"released_at":250},107588,"v2.6.1","## What's Changed\r\n* [From Author] Link CacheGen and CacheBlend to LMCache by @KuntaiDu in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F80\r\n* 🔥[LORC] Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F81\r\n* Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F82\r\n* [LLM Inference] LARGE LANGUAGE MODEL INFERENCE ACCELERATION: A COMPREHENSIVE HARDWARE PERSPECTIVE by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F83\r\n* 🔥[PARALLELSPEC] PARALLELSPEC: PARALLEL DRAFTER FOR EFFICIENT SPECULATIVE DECODING by @DefTruth in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F84\r\n\r\n## New Contributors\r\n* @KuntaiDu made their first contribution in https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fpull\u002F80\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FDefTruth\u002FAwesome-LLM-Inference\u002Fcompare\u002Fv2.6...v2.6.1","2024-10-14T05:08:04"]