[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-October2001--Awesome-KV-Cache-Compression":3,"tool-October2001--Awesome-KV-Cache-Compression":64},[4,17,25,39,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":10,"last_commit_at":23,"category_tags":24,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":26,"name":27,"github_repo":28,"description_zh":29,"stars":30,"difficulty_score":10,"last_commit_at":31,"category_tags":32,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[33,34,35,36,14,37,15,13,38],"图像","数据工具","视频","插件","其他","音频",{"id":40,"name":41,"github_repo":42,"description_zh":43,"stars":44,"difficulty_score":45,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[14,33,13,15,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":45,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74939,"2026-04-05T23:16:38",[15,33,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":45,"last_commit_at":62,"category_tags":63,"status":16},2181,"OpenHands","OpenHands\u002FOpenHands","OpenHands 是一个专注于 AI 驱动开发的开源平台，旨在让智能体（Agent）像人类开发者一样理解、编写和调试代码。它解决了传统编程中重复性劳动多、环境配置复杂以及人机协作效率低等痛点，通过自动化流程显著提升开发速度。\n\n无论是希望提升编码效率的软件工程师、探索智能体技术的研究人员，还是需要快速原型验证的技术团队，都能从中受益。OpenHands 提供了灵活多样的使用方式：既可以通过命令行（CLI）或本地图形界面在个人电脑上轻松上手，体验类似 Devin 的流畅交互；也能利用其强大的 Python SDK 自定义智能体逻辑，甚至在云端大规模部署上千个智能体并行工作。\n\n其核心技术亮点在于模块化的软件智能体 SDK，这不仅构成了平台的引擎，还支持高度可组合的开发模式。此外，OpenHands 在 SWE-bench 基准测试中取得了 77.6% 的优异成绩，证明了其解决真实世界软件工程问题的能力。平台还具备完善的企业级功能，支持与 Slack、Jira 等工具集成，并提供细粒度的权限管理，适合从个人开发者到大型企业的各类用户场景。",70626,"2026-04-05T22:51:36",[15,14,13,36],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":81,"owner_website":81,"owner_url":82,"languages":81,"stars":83,"forks":84,"last_commit_at":85,"license":86,"difficulty_score":87,"env_os":88,"env_gpu":89,"env_ram":90,"env_deps":91,"category_tags":96,"github_topics":97,"view_count":10,"oss_zip_url":81,"oss_zip_packed_at":81,"status":16,"created_at":101,"updated_at":102,"faqs":103,"releases":131},1347,"October2001\u002FAwesome-KV-Cache-Compression","Awesome-KV-Cache-Compression","📰 Must-read papers on KV Cache Compression (constantly updating 🤗).","Awesome-KV-Cache-Compression 是一份持续更新的“知识地图”，专门收集大模型推理时最头疼的 KV 缓存压缩技术。它把论文、开源实现和综述一网打尽，帮你快速找到“怎样让长文本推理既省显存又不掉精度”的答案。\n\n面对动辄几十 GB 的 KV 缓存，这份清单整理了剪枝、稀疏化、量化、重计算等主流思路，并给出可直接跑的代码仓库（如 NVIDIA 的 kvpress、KVCache-Factory），让你不用从零造轮子。\n\n适合正在做 LLM 部署、优化或研究的开发者、算法工程师与学术人员；如果你只是想了解大模型如何“省内存”，也能在这里找到通俗综述与一键直达的 PDF。\n\n亮点在于“实时更新”与“一站式”：新论文出现即收录，所有链接点标题就能跳原文，省去四处检索的烦恼。","\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOctober2001_Awesome-KV-Cache-Compression_readme_f634855ab72f.png\" width=90%>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n   \n[![LICENSE](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FOctober2001\u002FAwesome-KV-Cache-Compression)](https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression\u002Fblob\u002Fmain\u002FLICENSE)\n[![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome)\n[![commit](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flast-commit\u002FOctober2001\u002FAwesome-KV-Cache-Compression?color=blue)](https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression\u002Fcommits\u002Fmain)\n[![PR](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRs-Welcome-red)](https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression\u002Fpulls)\n[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOctober2001\u002FAwesome-KV-Cache-Compression)](https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression)\n\n\u003C\u002Fdiv>\n\n## 📢 News\n🎉 [2024-07-23] Project Beginning 🥳\n\n## 📜 Notice\n\nThis repository is constantly updating 🤗 ...\n> You can directly click on the title to jump to the corresponding PDF link location\n\n## ⚙️ Project\n1. [**kvpress.**](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress) *NVIDIA.* [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002Fkvpress)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress)\n* This repository implements multiple KV cache pruning methods and benchmarks using 🤗 transformers.\n\n2. [**KVCache-Factory.**](https:\u002F\u002Fgithub.com\u002FZefan-Cai\u002FKVCache-Factory) [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZefan-Cai\u002FKVCache-Factory)](https:\u002F\u002Fgithub.com\u002FZefan-Cai\u002FKVCache-Factory)\n* Unified KV Cache Compression Methods for Auto-Regressive Models.\n\n## 📷 Survey\n\n1. [**Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18003) *Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai.* COLM 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzcli-charlie\u002FAwesome-KV-Cache)](https:\u002F\u002Fgithub.com\u002Fzcli-charlie\u002FAwesome-KV-Cache)\n\n2. [**Prompt Compression for Large Language Models: A Survey.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12388) *Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier.* Arxiv 2024.\n   \n3. [**A Survey on Large Language Model Acceleration based on KV Cache Management.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19442) *Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTreeAI-Lab\u002FAwesome-KV-Cache-Management)](https:\u002F\u002Fgithub.com\u002FTreeAI-Lab\u002FAwesome-KV-Cache-Management)\n\n\n## 🔍 Method\n\n### 1️⃣ Pruning \u002F Evicting \u002F Sparse\n\n1. [**Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17118) *Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava.* NeurIPS 2023. \n\n2. [**SnapKV: LLM Knows What You are Looking for Before Generation.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14469) *Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FSnapKV)](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FSnapKV)\n\n3. [**H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14048) *Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen.* NeurIPS 2023. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FH2O)](https:\u002F\u002Fgithub.com\u002FFMInference\u002FH2O)\n\n4. [**Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01801) *Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao.* ICLR 2024.\n\n5. [**PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.12532) *Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao.* ACL 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmutonix\u002Fpyramidinfer)](https:\u002F\u002Fgithub.com\u002Fmutonix\u002Fpyramidinfer)\n\n6. [**PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.02069) *Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZefan-Cai\u002FPyramidKV)](https:\u002F\u002Fgithub.com\u002FZefan-Cai\u002FPyramidKV)\n\n7. [**Transformers are Multi-State RNNs.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06104) *Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fschwartz-lab-NLP\u002FTOVA)](https:\u002F\u002Fgithub.com\u002Fschwartz-lab-NLP\u002FTOVA)\n\n8. [**Efficient Streaming Language Models with Attention Sinks.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17453) *Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis.* ICLR 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fstreaming-llm)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fstreaming-llm)\n\n9. [**A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.11430) *Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini.* EMNLP 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falessiodevoto\u002Fl2compress)](https:\u002F\u002Fgithub.com\u002Falessiodevoto\u002Fl2compress)\n\n10. [**Retrieval Head Mechanistically Explains Long-Context Factuality.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.15574) *Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnightdessert\u002FRetrieval_Head)](https:\u002F\u002Fgithub.com\u002Fnightdessert\u002FRetrieval_Head)\n\n11. [**Efficient Sparse Attention needs Adaptive Token Release.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.02328) *Chaoran Zhang, Lixin Zou, Dan Luo, Min Tang, Xiangyang Luo, Zihao Li, Chenliang Li.* ACL 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWHUIR\u002FADORE)](https:\u002F\u002Fgithub.com\u002FWHUIR\u002FADORE)\n\n12. [**Loki: Low-Rank Keys for Efficient Sparse Attention.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.02542) *Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele.* Arxiv 2024.\n\n13. [**Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.09398) *Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhdong920\u002FLESS)](https:\u002F\u002Fgithub.com\u002Fhdong920\u002FLESS)\n\n14. [**ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.17312) *Youpeng Zhao, Di Wu, Jun Wang.* ISCA 2024.\n\n15. [**Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09054) *Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fd-matrix-ai\u002Fkeyformer-llm)](https:\u002F\u002Fgithub.com\u002Fd-matrix-ai\u002Fkeyformer-llm)\n\n16. [**Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11550) *Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFFY0\u002FAdaKV)](https:\u002F\u002Fgithub.com\u002FFFY0\u002FAdaKV)\n\n17. [**Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12335) *Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fguozhiyu\u002Fvatp)](https:\u002F\u002Fgithub.com\u002Fguozhiyu\u002Fvatp)\n\n18. [**On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12335) *Siyu Ren, Kenny Q. Zhu.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDRSY\u002FEasyKV)](https:\u002F\u002Fgithub.com\u002FDRSY\u002FEasyKV)\n\n19. [**CORM: Cache Optimization with Recent Message for Large Language Model Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.15949) *Jincheng Dai, Zhuowei Huang, Haiyun Jiang, Chen Chen, Deng Cai, Wei Bi, Shuming Shi.* Arxiv 2024.\n    \n20. [**RazorAttention: Efficient KV Cache Compression Through Retrieval Heads.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2407.15891) *Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, Gongyi Wang.* Arxiv 2024.\n\n21. [**A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20485) *Hyun Rae Jo, Dong Kun Shin.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDirac-Notation\u002FA2SF)](https:\u002F\u002Fgithub.com\u002FDirac-Notation\u002FA2SF)\n\n22. [**Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.10774) *Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han.* ICML 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002FQuest)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002FQuest)\n\n23. [**LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.14057) *Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi.* Arxiv 2024.\n\n24. [**NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.03675) *Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu.* ACL 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPaddlePaddle\u002FResearch)](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FResearch\u002Ftree\u002Fmaster\u002FNLP\u002FACL2024-NACL)\n\n25. [**Post-Training Sparse Attention with Double Sparsity.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.07092) *Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fandy-yang-1\u002FDoubleSparse)](https:\u002F\u002Fgithub.com\u002Fandy-yang-1\u002FDoubleSparse)\n\n26. [**Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2407.15176) *Xiaoran Liu, Qipeng Guo, Yuerong Song, Zhigeng Liu, Kai Lv, Hang Yan, Linlin Li, Qun Liu, Xipeng Qiu.* Arxiv 2024.\n\n27. [**Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09636) *Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti.* ICML 2024.\n\n28. [**MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.02490) *Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu.* NeurIPS 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference)\n\n29. [**Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15805) *Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann.* NeurIPS 2023.\n\n30. [**RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10516) *Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu.* Arxiv 2024.\n\n31. [**Sirius: Contextual Sparsity with Correction for Efficient LLMs.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.03856) *Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, Beidi Chen.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finfini-ai-lab\u002Fsirius)](https:\u002F\u002Fgithub.com\u002Finfini-ai-lab\u002Fsirius)\n\n32. [**Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.09086) *Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finfly-ai\u002FINF-MLLM)](https:\u002F\u002Fgithub.com\u002Finfly-ai\u002FINF-MLLM)\n\n33. [**Training-Free Activation Sparsity in Large Language Models.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.14690) *James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FTEAL)](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FTEAL)\n\n34. [**KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.11057) *Bo Lv, Quan Zhou, Xuanang Ding, Yan Wang, Zeming Ma.* Arxiv 2024.\n\n35. [**CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.12490) *Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie.* Arxiv 2024.\n\n36. [**Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17422) *Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty.* Arxiv 2024.\n\n37. [**KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00161) *Isaac Rehg.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIsaacRe\u002Fvllm-kvcompress)](https:\u002F\u002Fgithub.com\u002FIsaacRe\u002Fvllm-kvcompress)\n\n38. [**InfiniPot: Infinite Context Processing on Memory-Constrained LLMs.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01518) *Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang.* EMNLP 2024.\n\n39. [**Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01805) *Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuangyuxiang03\u002FLocret)](https:\u002F\u002Fgithub.com\u002Fhuangyuxiang03\u002FLocret)\n\n40. [**SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04417) *Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGumpest\u002FSparseVLMs)](https:\u002F\u002Fgithub.com\u002FGumpest\u002FSparseVLMs)\n\n41. [**LoCoCo: Dropping In Convolutions for Long Context Compression.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.05317) *Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen.* ICML 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-Group\u002FLoCoCo)](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FLoCoCo)\n\n42. [**DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10819) *Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fduo-attention)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fduo-attention)\n\n43. [**SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13846) *Xuan Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsail-sg\u002FSimLayerKV)](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FSimLayerKV)\n\n44. [**In-context KV-Cache Eviction for LLMs via Attention-Gate.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12876) *Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng.* Arxiv 2024.\n\n45. [**CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07240) *Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang.* ACM SIGCOMM 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLMCache\u002FLMCache)](https:\u002F\u002Fgithub.com\u002FLMCache\u002FLMCache)\n\n46. [**MagicPIG: LSH Sampling for Efficient LLM Generation.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16179) *Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInfini-AI-Lab\u002FMagicPIG)](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FMagicPIG)\n\n47. [**TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05076) *Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDerrickYLJ\u002FTidalDecode)](https:\u002F\u002Fgithub.com\u002FDerrickYLJ\u002FTidalDecode)\n\n48. [**ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21465) *Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FShadowKV)](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FShadowKV)\n\n49. [**BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23079) *Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJunqiZhao888\u002Fbuzz-llm)](https:\u002F\u002Fgithub.com\u002FJunqiZhao888\u002Fbuzz-llm)\n\n50. [**CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12018) *Yu Bai, Xiyuan Zou, Heyan Huang, Sanxing Chen, Marc-Antoine Rondeau, Yang Gao, Jackie Chi Kit Cheung.* EMNLP 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fybai-nlp\u002FCItruS)](https:\u002F\u002Fgithub.com\u002Fybai-nlp\u002FCItruS)\n\n51. [**TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02886) *Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Kun Fu, Zheng Wang, Hui Xiong.* Arxiv 2024.\n\n52. [**Recycled Attention: Efficient inference for long-context language models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.05787) *Fangyuan Xu, Tanya Goyal, Eunsol Choi.* Arxiv 2024.\n\n53. [**VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23317) *Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, Panpan Xu.* Arxiv 2024.\n\n54. [**Squeezed Attention: Accelerating Long Context Length LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09688) *Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami.* Arxiv 2024.\n\n55. [**ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction.**](https:\u002F\u002Fgithub.com\u002Fpku-liang\u002FArkVale\u002Fblob\u002Fmain\u002Fmedia\u002Farkvale-nips24-paper.pdf) *Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, Yun Liang.* NeurIPS 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpku-liang\u002FArkVale)](https:\u002F\u002Fgithub.com\u002Fpku-liang\u002FArkVale)\n\n56. [**Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19258) *Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFYYFU\u002FHeadKV)](https:\u002F\u002Fgithub.com\u002FFYYFU\u002FHeadKV)\n\n57. [**[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster.**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.01818) *Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTheia-4869\u002FFasterVLM)](https:\u002F\u002Fgithub.com\u002FTheia-4869\u002FFasterVLM)\n\n58. [**Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10197) *Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTheia-4869\u002FFasterVLM)](https:\u002F\u002Fgithub.com\u002Fywh187\u002FFitPrune)\n\n59. [**ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03213) *Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo.* Arxiv 2024.\n\n60. [**Unifying KV Cache Compression for Large Language Models with LeanKV.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03131) *Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen.* Arxiv 2024.\n\n61. [**DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14838) *Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, Liang Ding.* Arxiv 2024.\n\n62. [**SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.13649) *Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLinking-ai\u002FSCOPE)](https:\u002F\u002Fgithub.com\u002FLinking-ai\u002FSCOPE)\n\n63. [**HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16187) *Minghui Liu, Tahseen Rabbani, Tony O'Halloran, Ananth Sankaralingam, Mary-Anne Hartley, Brian Gravelle, Furong Huang, Cornelia Fermüller, Yiannis Aloimonos.* Arxiv 2024.\n\n64. [**SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.12094) *Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHKUDS\u002FSepLLM)](https:\u002F\u002Fgithub.com\u002FHKUDS\u002FSepLLM)\n\n65. [**MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18077) *Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, Minjia Zhang.* Arxiv 2025.\n\n66. [**FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01068) *Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdongwonjo\u002FFastKV)](https:\u002F\u002Fgithub.com\u002Fdongwonjo\u002FFastKV)\n\n67. [**ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00299) *Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Bo Li, Xuming Hu, Xiaowen Chu.* Arxiv 2025.\n\n68. [**LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14866) *Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han.* MLSys 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fomniserve)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fomniserve)\n\n69. [**RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2502.14051) *Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov.* Arxiv 2025.\n    \n70. [**Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00876) *Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin.* ICLR 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOsilly\u002Fdynamic_llava)](https:\u002F\u002Fgithub.com\u002FOsilly\u002Fdynamic_llava)\n\n71. [**DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16886) *Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li.* Arxiv 2025.\n\n72. [**Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00979) *Ravi Ghadia, Avinash Kumar, Gaurav Jain, Prashant Nair, Poulami Das.* ICML 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fghadiaravi13\u002FMorphKV)](https:\u002F\u002Fgithub.com\u002Fghadiaravi13\u002FMorphKV)\n\n73. [**MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17599) *Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang.* NAACL 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIoT-MLSys-Lab\u002FMEDA)](https:\u002F\u002Fgithub.com\u002FAIoT-MLSys-Lab\u002FMEDA)\n\n74. [**KVCrush: Key value cache size-reduction using similarity in head-behaviour.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00022) *Gopi Krishna Jha, Sameh Gobriel, Liubov Talamanova, Alexander Kozlov, Nilesh Jain.* Arxiv 2025.\n\n75. [**LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.08879) *Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Li, Changran Hu, Bo Li, Urmish Thakker.* ICLR 2025.\n\n76. [**SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16163) *Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han.* Arxiv 2025.\n\n77. [**KV-Distill: Nearly Lossless Learnable Context Compression for LLMs.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10337) *Vivek Chari, Guanghui Qin, Benjamin Van Durme.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvnchari\u002Fkv-distill)](https:\u002F\u002Fgithub.com\u002Fvnchari\u002Fkv-distill)\n\n78. [**SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00970) *Yuxuan Zhu, Ali Falahati, David H. Yang, Mohammad Mohammadi Amiri.* Arxiv 2025.\n\n79. [**The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.17768) *Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti.* Arxiv 2025.\n\n80. [**FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00570) *Jushi Kai, Boyi Zeng, Yixuan Wang, Haoli Bai, Bo Jiang, Zhouhan Lin.* Arxiv 2025.\n\n81. [**Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22913) *Donghyeon Joo, Helya Hosseini, Ramyad Hadidi, Bahar Asgari.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdhjoo98\u002Fmustafar)](https:\u002F\u002Fgithub.com\u002Fdhjoo98\u002Fmustafar)\n    \n82. [**CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.12491) *Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, Jianguo Li.* ICLR 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fantgroup\u002Fcakekv)](https:\u002F\u002Fgithub.com\u002Fantgroup\u002Fcakekv)\n\n83. [**R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24133) *Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZefan-Cai\u002FR-KV)](https:\u002F\u002Fgithub.com\u002FZefan-Cai\u002FR-KV)\n\n84. [**KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23416) *Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FKVzip)](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FKVzip)\n\n85. [**Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2506.05410) *Wanyun Cui, Mingwei Xu.* Arxiv 2025.\n\n86. [**AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2506.03762) *Yifeng Gu, Zicong Jiang, Jianxiu Jin, Kailing Guo, Ziyang Zhang, Xiangmin Xu.* Arxiv 2025.\n\n87. [**Inference-Time Hyper-Scaling with KV Cache Compression.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05345) *Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti.* Arxiv 2025.\n\n88. [**InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15745v1) *Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang.* Arxiv 2025.\n\n89. [**Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17121) *Adithya Bhaskar, Alexander Wettig, Tianyu Gao, Yihe Dong, Danqi Chen.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-pli\u002FPruLong)](https:\u002F\u002Fgithub.com\u002Fprinceton-pli\u002FPruLong)\n\n90. [**Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.08143) *Vivek Chari, Benjamin Van Durme.* Arxiv 2025.\n\n91. [**LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.14204) *Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan (Celine)Lin.* ICML 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGATECH-EIC\u002FLaCache)](https:\u002F\u002Fgithub.com\u002FGATECH-EIC\u002FLaCache)\n\n92. [**EvolKV: Evolutionary KV Cache Compression for LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08315) *Bohan Yu, Yekun Chai.* Arxiv 2025.\n\n93. [**LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation.**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.09754) *Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, Nguyen Cam-Tu.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMGDDestiny\u002FLava)](https:\u002F\u002Fgithub.com\u002FMGDDestiny\u002FLava)\n\n94. [**KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15364) *Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, Chris Lott.* NeurIPS 2025.\n\n95. [**CAOTE: KV Cache Selection for LLMs via Attention Output Error-Based Token Eviction.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14051) *Raghavv Goel, Junyoung Park, Mukul Gagrani, Dalton Jones, Matthew Morse, Harper Langston, Mingu Lee, Chris Lott.* Arxiv 2025.\n\n\n### 2️⃣ Merging\n\n1. [**D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.13035) *Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang.* ICLR 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIoT-MLSys-Lab\u002Fd2o)](https:\u002F\u002Fgithub.com\u002FAIoT-MLSys-Lab\u002Fd2o)\n   \n2. [**Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.08454) *Zheng Wang, Boxiao Jin, Zhongzhi Yu, Minjia Zhang.* Arxiv 2024.\n\n3. [**CaM: Cache Merging for Memory-efficient LLMs Inference.**](https:\u002F\u002Fopenreview.net\u002Fforum?id=LCTmppB165) *Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji.* ICML 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzyxxmu\u002Fcam)](https:\u002F\u002Fgithub.com\u002Fzyxxmu\u002Fcam)\n\n4. [**Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.10308) *Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin.* ICLR 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falinlab\u002FHOMER)](https:\u002F\u002Fgithub.com\u002Falinlab\u002FHOMER)\n\n5. [**Token Merging: Your ViT But Faster.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.09461) *Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman.* ICLR 2023. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002FToMe)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FToMe)\n\n6. [**LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.18139) *Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan.* EMNLP 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSUSTechBruce\u002FLOOK-M)](https:\u002F\u002Fgithub.com\u002FSUSTechBruce\u002FLOOK-M)\n\n7. [**Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.07143) *Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal.* Arxiv 2024. \n\n8. [**Compressed Context Memory for Online Language Model Interaction.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03414) *Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song.* ICLR 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FContext-Memory)](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FContext-Memory)\n\n9. [**CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16444) *Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang.* EuroSys 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLMCache\u002FLMCache)](https:\u002F\u002Fgithub.com\u002FLMCache\u002FLMCache)\n\n10. [**AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03248) *Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLaVi-Lab\u002FAIM)](https:\u002F\u002Fgithub.com\u002FLaVi-Lab\u002FAIM)\n\n11. [**KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09936) *Yuxuan Tian, Zihan Wang, Yebo Peng, Aomufei Yuan, Zhiming Wang, Bairen Yi, Xin Liu, Yong Cui, Tong Yang.* Arxiv 2025.\n\n### 3️⃣ Cross-Layer\n\n1. [**You Only Cache Once: Decoder-Decoder Architectures for Language Models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.05254) *Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei.* NeurIPS 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002FYOCO)\n   \n2. [**Reducing Transformer Key-Value Cache Size with Cross-Layer Attention.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.12981) *William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly.* Arxiv 2024.\n   \n3. [**Layer-Condensed KV Cache for Efficient Inference of Large Language Models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.10637) *Haoyi Wu, Kewei Tu.* ACL 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FwhyNLP\u002FLCKV)](https:\u002F\u002Fgithub.com\u002FwhyNLP\u002FLCKV)\n\n4. [**MiniCache: KV Cache Compression in Depth Dimension for Large Language Models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14366) *Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAkideLiu\u002FMiniCache)](https:\u002F\u002Fgithub.com\u002FAkideLiu\u002FMiniCache)\n\n5. [**MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09297) *Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzaydzuhri\u002Fpythia-mlkv)](https:\u002F\u002Fgithub.com\u002Fzaydzuhri\u002Fpythia-mlkv)\n\n6. [**A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14442) *You Wu, Haoyi Wu, Kewei Tu.* Arxiv 2024.\n\n7. [**KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18517) *Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangyifei729\u002FKVSharer)](https:\u002F\u002Fgithub.com\u002Fyangyifei729\u002FKVSharer)\n\n8. [**SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03960) *Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnowflakedb\u002Farctictraining)](https:\u002F\u002Fgithub.com\u002Fsnowflakedb\u002Farctictraining)\n\n9. [**Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02252) *Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu.* Arxiv 2024. \n\n### 4️⃣ Low-Rank\n\n1. [**Fast Transformer Decoding: One Write-Head is All You Need.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.02150) *Noam Shazeer.* Arxiv 2019.\n   \n2. [**GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13245) *Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai.* EMNLP 2023.\n\n3. [**DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04434) *DeepSeek-AI.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V2)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V2)\n\n4. [**Effectively Compress KV Heads for LLM.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07056) *Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu.* Arxiv 2024.\n\n5. [**Palu: Compressing KV-Cache with Low-Rank Projection.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21118) *Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshadowpa0327\u002FPalu)](https:\u002F\u002Fgithub.com\u002Fshadowpa0327\u002FPalu)\n\n6.  [**LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03111) *Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen.* Arxiv 2024.\n\n7.  [**Tensor Product Attention Is All You Need.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06425) *Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftensorgi\u002FT6)](https:\u002F\u002Fgithub.com\u002Ftensorgi\u002FT6)\n\n8.  [**ThinK: Thinner Key Cache by Query-Driven Pruning.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21018) *Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSalesforceAIResearch\u002FThinK)](https:\u002F\u002Fgithub.com\u002FSalesforceAIResearch\u002FThinK)\n\n9.  [**Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11886) *Xiaoran Liu, Siyang He, Qiqi Wang, Ruixiao Li, Yuerong Song, Zhigeng Liu, Linlin Li, Qun Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu.* Arxiv 2025.\n\n10.  [**OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2509.21623) *Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen.* Arxiv 2025.\n\n\n### 5️⃣ Quantization\n\n1. [**ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2405.14256) *Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang.* Arxiv 2024.\n\n2. [**No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.18096) *June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee.* Arxiv 2024.\n\n3. [**KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.02750) *Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu.* ICML 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjy-yuan\u002FKIVI)](https:\u002F\u002Fgithub.com\u002Fjy-yuan\u002FKIVI)\n\n4. [**GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F203.05527) *Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopengear-project\u002FGEAR)](https:\u002F\u002Fgithub.com\u002Fopengear-project\u002FGEAR)\n\n5. [**PQCache: Product Quantization-based KVCache for Long Context LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.12820) *Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui.* Arxiv 2024.\n\n6. [**Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.12591) *Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen.* Arxiv 2024.\n\n7. [**SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.06219) *Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcat538\u002FSKVQ)](https:\u002F\u002Fgithub.com\u002Fcat538\u002FSKVQ)\n\n8. [**QAQ: Quality Adaptive Quantization for LLM KV Cache.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.04643) *Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FClubieDong\u002FQAQ-KVCacheQuantization)](https:\u002F\u002Fgithub.com\u002FClubieDong\u002FQAQ-KVCacheQuantization)\n\n9. [**KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.18079) *Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami.* NeurIPS 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FKVQuant)](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FKVQuant)\n\n10. [**WKVQuant: Quantizing Weight and Key\u002FValue Cache for Large Language Models Gains More.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12065) *Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie.* Arxiv 2024.\n\n11. [**KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04420) *Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan.* ICML 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcmd2001\u002FKVTuner)](https:\u002F\u002Fgithub.com\u002Fcmd2001\u002FKVTuner)\n\n12. [**Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16257) *Keda Tao, Haoxuan You, Yang Sui, Can Qin, Huan Wang.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKD-TAO\u002FVidKV)](https:\u002F\u002Fgithub.com\u002FKD-TAO\u002FVidKV)\n\n13. [**BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18773) *Dayou Du, Shijie Cao, Jianyi Cheng, Ting Cao, Mao Yang.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDD-DuDa\u002FBitDecoding)](https:\u002F\u002Fgithub.com\u002FDD-DuDa\u002FBitDecoding)\n\n14. [**TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2506.04642) *Vinay Joshi, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum\n.* ACL 2025 Industry Track.\n\n15. [**TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19586) *Dingyu Yao, Bowen Shen, Zheng Lin, Wei Liu, Jian Luan, Bin Wang, Weiping Wang.* ACL 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fydyhello\u002FTailorKV)](https:\u002F\u002Fgithub.com\u002Fydyhello\u002FTailorKV)\n\n16. [**CommVQ: Commutative Vector Quantization for KV Cache Compression.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.18879) *Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan.* ICML 2025.\n\n\n\n### 6️⃣ Prompt Compression\n\n1. [**LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05736) *Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu.* EMNLP 2023. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)\n\n2. [**LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12968) *Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang.* ACL 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)\n\n3. [**LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06839) *Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu.* ACL 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)\n\n4. [**TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.13035) *Shivam Shandilya, Menglin Xia, Supriyo Ghosh, Huiqiang Jiang, Jue Zhang, Qianhui Wu, Victor Rühle.* Arxiv 2024.\n\n5. [**ICPC: In-context Prompt Compression with Faster Inference.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01625) *Ziyang Yu, Yuyu Liu.* Arxiv 2025.\n\n6. [**Better Prompt Compression Without Multi-Layer Perceptrons.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06730) *Edouardo Honig, Andrew Lizarraga, Zijun Frank Zhang, Ying Nian Wu.* Arxiv 2025.\n\n\n### 7️⃣ Reuse\n\n1. [**KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2502.16002) *Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUCSB-NLP-Chang\u002FKVLink)](https:\u002F\u002Fgithub.com\u002FUCSB-NLP-Chang\u002FKVLink)\n\n\n### 8️⃣ Non-Autoregressive\n\n1. [**dKV-Cache: The Cache for Diffusion Language Models.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15781) *Xinyin Ma, Runpeng Yu, Gongfan Fang, Xinchao Wang.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhorseee\u002FdKV-Cache)](https:\u002F\u002Fgithub.com\u002Fhorseee\u002FdKV-Cache)\n\n2. [**dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching.**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2506.06295) *Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyan Wei, Shaobo Wang, Linfeng Zhang.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmaomaocun\u002FdLLM-cache)](https:\u002F\u002Fgithub.com\u002Fmaomaocun\u002FdLLM-cache)\n\n3. [**Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22618) *Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FFast-dLLM)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFast-dLLM)\n\n\n## 📊 Evaluation\n\n1. [**KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.01527) *Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu.* EMNLP 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhenryzhongsc\u002Flongctx_bench)](https:\u002F\u002Fgithub.com\u002Fhenryzhongsc\u002Flongctx_bench)\n\n2. [**SCBench: A KV Cache-Centric Analysis of Long-Context Methods.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10319) *Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference\u002Ftree\u002Fmain\u002Fscbench)\n\n3. [**More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.12706) *Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li.* Arxiv 2024. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhzihao\u002FQPruningKV)](https:\u002F\u002Fgithub.com\u002Fzhzihao\u002FQPruningKV)\n\n4. [**Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving.**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24000) *Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen.* Arxiv 2025. [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLMkvsys\u002Frethink-kv-compression)](https:\u002F\u002Fgithub.com\u002FLLMkvsys\u002Frethink-kv-compression)\n\n","\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOctober2001_Awesome-KV-Cache-Compression_readme_f634855ab72f.png\" width=90%>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n   \n[![LICENSE](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FOctober2001\u002FAwesome-KV-Cache-Compression)](https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression\u002Fblob\u002Fmain\u002FLICENSE)\n[![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome)\n[![commit](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flast-commit\u002FOctober2001\u002FAwesome-KV-Cache-Compression?color=blue)](https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression\u002Fcommits\u002Fmain)\n[![PR](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRs-Welcome-red)](https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression\u002Fpulls)\n[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOctober2001\u002FAwesome-KV-Cache-Compression)](https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression)\n\n\u003C\u002Fdiv>\n\n## 📢 新闻\n🎉 [2024-07-23] 项目启动 🥳\n\n## 📜 通知\n\n本仓库持续更新中 🤗 ...\n> 您可以直接点击标题跳转至对应的PDF链接位置\n\n## ⚙️ 项目\n1. [**kvpress.**](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress) *NVIDIA.* [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002Fkvpress)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress)\n* 本仓库实现了多种KV缓存剪枝方法，并使用🤗 transformers进行基准测试。\n\n2. [**KVCache-Factory.**](https:\u002F\u002Fgithub.com\u002FZefan-Cai\u002FKVCache-Factory) [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZefan-Cai\u002FKVCache-Factory)](https:\u002F\u002Fgithub.com\u002FZefan-Cai\u002FKVCache-Factory)\n* 面向自回归模型的统一KV缓存压缩方法。\n\n## 📷 调查\n\n1. [**降低成本：LLM KV缓存消耗优化方法综述。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18003) *罗世和、张洪毅、姚瑶、李祖超、赵海。* COLM 2024。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzcli-charlie\u002FAwesome-KV-Cache)](https:\u002F\u002Fgithub.com\u002Fzcli-charlie\u002FAwesome-KV-Cache)\n\n2. [**大型语言模型的提示压缩：综述。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12388) *李宗谦、刘银红、苏一轩、奈杰尔·科利尔。* Arxiv 2024。\n   \n3. [**基于KV缓存管理的大型语言模型加速综述。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19442) *李浩洋、李一鸣、田安欣、唐天浩、徐占超、陈雪佳、妮可·胡、董伟、李青、陈磊。* Arxiv 2024。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTreeAI-Lab\u002FAwesome-KV-Cache-Management)](https:\u002F\u002Fgithub.com\u002FTreeAI-Lab\u002FAwesome-KV-Cache-Management)\n\n\n## 🔍 方法\n\n### 1️⃣ 剪枝 \u002F 驱逐 \u002F 稀疏\n\n1. [**剪刀手：利用重要性假设的持久性进行LLM KV缓存测试时压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17118) *刘子畅、阿迪蒂亚·德赛、廖方硕、王伟涛、维克多·谢、徐兆卓、阿纳斯塔西奥斯·基里利迪斯、安舒马利·施里瓦斯塔瓦。* NeurIPS 2023。 \n\n2. [**SnapKV：LLM在生成前就知道你在找什么。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14469) *李宇鸿、黄英兵、杨博文、巴拉特·文基特什、阿西尔·洛卡泰利、叶汉晨、蔡天乐、帕特里克·刘易斯、陈德明。* Arxiv 2024。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FSnapKV)](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FSnapKV)\n\n3. [**H2O：用于高效生成式推理的大型语言模型重击者Oracle。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14048) *张振宇、盛颖、周天义、陈天龙、郑连敏、蔡瑞思、宋赵、田远东、克里斯托弗·雷、克拉克·巴雷特、王张阳、陈贝蒂。* NeurIPS 2023。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFMInference\u002FH2O)](https:\u002F\u002Fgithub.com\u002FFMInference\u002FH2O)\n\n4. [**模型告诉你该丢弃什么：LLM的自适应KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01801) *葛素玉、张云楠、刘丽媛、张敏嘉、韩家伟、高建峰。* ICLR 2024。\n\n5. [**PyramidInfer：面向高吞吐量LLM推理的金字塔KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.12532) *杨东杰、韩晓东、高燕、胡耀、张士林、赵海。* ACL 2024。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmutonix\u002Fpyramidinfer)](https:\u002F\u002Fgithub.com\u002Fmutonix\u002Fpyramidinfer)\n\n6. [**PyramidKV：基于金字塔式信息漏斗的动态KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.02069) *蔡泽凡、张一驰、高博飞、刘宇梁、刘天宇、陆克明、熊韦恩、董悦、常宝宝、胡俊杰、肖文。* Arxiv 2024。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZefan-Cai\u002FPyramidKV)](https:\u002F\u002Fgithub.com\u002FZefan-Cai\u002FPyramidKV)\n\n7. [**Transformer是多状态RNN。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06104) *奥伦·马塔内尔、迈克尔·哈西德、尼尔·亚登、约西·阿迪、罗伊·施瓦茨。* Arxiv 2024。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fschwartz-lab-NLP\u002FTOVA)](https:\u002F\u002Fgithub.com\u002Fschwartz-lab-NLP\u002FTOVA)\n\n8. [**带有注意力汇流的高效流式语言模型。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17453) *肖光轩、田远东、陈贝蒂、韩松、刘易斯。* ICLR 2024。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fstreaming-llm)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fstreaming-llm)\n\n9. [**一种简单而有效的基于L2范数的KV缓存压缩策略。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.11430) *德沃托、赵宇、斯卡达帕内、米内尔维尼。* EMNLP 2024。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falessiodevoto\u002Fl2compress)](https:\u002F\u002Fgithub.com\u002Falessiodevoto\u002Fl2compress)\n\n10. [**检索头从机制上解释了长上下文的事实性。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.15574) *吴文浩、王一忠、肖光轩、彭浩、傅瑶。* Arxiv 2024。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnightdessert\u002FRetrieval_Head)](https:\u002F\u002Fgithub.com\u002Fnightdessert\u002FRetrieval_Head)\n\n11. [**高效的稀疏注意力需要自适应的标记释放。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.02328) *张朝然、邹立新、罗丹、唐敏、罗向阳、李子豪、李晨亮。* ACL 2024。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWHUIR\u002FADORE)](https:\u002F\u002Fgithub.com\u002FWHUIR\u002FADORE)\n\n12. [**Loki：用于高效稀疏注意力的低秩密钥。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.02542) *普拉杰瓦尔·辛加尼亚、西达尔特·辛格、施怀·何、索海尔·费齐、阿比纳夫·巴特勒。* Arxiv 2024。\n\n13. [**用更少获得更多：通过KV缓存压缩合成递归以实现高效LLM推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.09398) *董哈利、杨心雨、张振宇、王张阳、迟月洁、陈贝蒂。* Arxiv 2024。[![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhdong920\u002FLESS)](https:\u002F\u002Fgithub.com\u002Fhdong920\u002FLESS)\n\n14. [**ALISA：通过稀疏感知的KV缓存加速大型语言模型推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.17312) *赵友鹏、吴迪、王军。* ISCA 2024。\n\n15. [**Keyformer：通过关键令牌选择实现KV缓存缩减，以提升生成式推理效率。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09054) *穆罕默德·阿德南、阿希尔·阿伦库马尔、高拉夫·贾因、普拉桑特·J·奈尔、伊利亚·索洛维伊奇克、普鲁索塔姆·卡马斯。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fd-matrix-ai\u002Fkeyformer-llm)](https:\u002F\u002Fgithub.com\u002Fd-matrix-ai\u002Fkeyformer-llm)\n\n16. [**Ada-KV：通过自适应预算分配优化KV缓存驱逐，以提升LLM推理效率。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11550) *袁峰、吕俊林、曹宇坤、谢希科、S·凯文·周。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFFY0\u002FAdaKV)](https:\u002F\u002Fgithub.com\u002FFFY0\u002FAdaKV)\n\n17. [**注意力分数并非KV缓存缩减中衡量标记重要性的唯一指标：价值同样重要。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12335) *郭志宇、上井秀隆、渡边太郎。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fguozhiyu\u002Fvatp)](https:\u002F\u002Fgithub.com\u002Fguozhiyu\u002Fvatp)\n\n18. [**关于键值约束下生成式语言模型推理中驱逐策略的有效性。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12335) *任思宇、朱凯尼·Q。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDRSY\u002FEasyKV)](https:\u002F\u002Fgithub.com\u002FDRSY\u002FEasyKV)\n\n19. [**CORM：基于近期消息的缓存优化，用于大语言模型推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.15949) *戴金成、黄卓伟、姜海云、陈晨、蔡登、毕伟、史淑明。* Arxiv 2024年。\n    \n20. [**RazorAttention：通过检索头实现高效的KV缓存压缩。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2407.15891) *唐汉林、林阳、林静、韩庆森、洪世宽、姚一武、王功义。* Arxiv 2024年。\n\n21. [**A2SF：在Transformer解码器中引入遗忘因子的累积注意力评分机制，用于标记剪枝。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20485) *赵贤来、申东坤。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDirac-Notation\u002FA2SF)](https:\u002F\u002Fgithub.com\u002FDirac-Notation\u002FA2SF)\n\n22. [**Quest：面向高效长上下文LLM推理的查询感知稀疏化。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.10774) *唐家铭、赵一龙、朱侃、肖广轩、巴里斯·卡西克奇、韩松。* ICML 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002FQuest)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002FQuest)\n\n23. [**LazyLLM：用于高效长上下文LLM推理的动态标记剪枝。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.14057) *傅启辰、赵敏锡、托马斯·梅尔斯、萨钦·梅赫塔、穆罕默德·拉斯泰加里、马赫亚尔·纳吉比。* Arxiv 2024年。\n\n24. [**NACL：一种通用且高效的LLM推理时KV缓存驱逐框架。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.03675) *陈一龙、王国霞、商俊远、崔诗瑶、张振宇、刘廷文、王硕焕、孙宇、于殿海、吴华。* ACL 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPaddlePaddle\u002FResearch)](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FResearch\u002Ftree\u002Fmaster\u002FNLP\u002FACL2024-NACL)\n\n25. [**训练后双重稀疏化的稀疏注意力。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.07092) *杨硕、盛颖、约瑟夫·E·冈萨雷斯、伊翁·斯托伊卡、郑连民。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fandy-yang-1\u002FDoubleSparse)](https:\u002F\u002Fgithub.com\u002Fandy-yang-1\u002FDoubleSparse)\n\n26. [**告别长度外推：有限注意力范围下的无训练无限上下文。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2407.15176) *刘晓然、郭启鹏、宋玉荣、刘志耿、吕凯、闫航、李琳琳、刘群、邱希鹏。* Arxiv 2024年。\n\n27. [**动态内存压缩：为LLM加速推理而进行的改造。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09636) *皮奥特尔·纳夫罗特、阿德里安·兰丘茨基、马尔钦·霍霍夫斯基、大卫·塔尔扬、爱德华多·M·蓬蒂。* ICML 2024年。\n\n28. [**MInference 1.0：通过动态稀疏注意力加速长上下文LLM的预填充。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.02490) *江辉强、李宇成、张成瑞东、吴千慧、罗旭芳、安素仁、韩振华、阿米尔·H·阿卜迪、李东升、林钦佑、杨宇清、邱丽丽。* NeurIPS 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference)\n\n29. [**用于高效且可解释的自回归Transformer的动态上下文剪枝。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15805) *索蒂里斯·阿纳格诺斯提迪斯、达里奥·帕夫洛、卢卡·比吉奥、洛伦佐·诺奇、奥雷利安·卢奇、托马斯·霍夫曼。* NeurIPS 2023年。\n\n30. [**RetrievalAttention：通过向量检索加速长上下文LLM推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10516) *刘迪、陈萌、陆宝彤、江辉强、韩振华、张千曦、陈琪、张成瑞东、丁白露、张凯、陈晨、杨凡、杨宇清、邱丽丽。* Arxiv 2024年。\n\n31. [**Sirius：用于高效LLM的上下文稀疏化与修正。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.03856) *周洋、陈卓明、徐兆卓、林维多利亚、陈蓓蒂。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finfini-ai-lab\u002Fsirius)](https:\u002F\u002Fgithub.com\u002Finfini-ai-lab\u002Fsirius)\n\n32. [**Inf-MLLM：单GPU上多模态大语言模型的高效流式推理。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.09086) *宁振宇、赵洁茹、金启浩、丁文超、郭敏怡。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finfly-ai\u002FINF-MLLM)](https:\u002F\u002Fgithub.com\u002Finfly-ai\u002FINF-MLLM)\n\n33. [**大语言模型中的无训练激活稀疏化。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.14690) *詹姆斯·刘、普拉加什·波努萨米、蔡天乐、郭汉、金允、本·阿提瓦拉特昆。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFasterDecoding\u002FTEAL)](https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FTEAL)\n\n34. [**KVPruner：用于更快速、更省存的大语言模型结构化剪枝。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.11057) *吕博、周权、丁宣昂、王岩、马泽明。* Arxiv 2024年。\n\n35. [**CritiPrefill：基于分段关键性评估的LLM预填充加速方法。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.12490) *吕俊林、袁峰、谢希科、贾欣、彭启荣、谢贵明。* Arxiv 2024年。\n\n36. [**发掘早期层中的精华：通过将输入标记减少1000倍加速长上下文LLM。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17422) *石珍美、明一飞、阮宣菲、梁英宇、乔蒂·沙菲克。* Arxiv 2024年。\n\n37. [**KV-Compress：按每个注意力头采用可变压缩率的分页KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00161) *艾萨克·雷格。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIsaacRe\u002Fvllm-kvcompress)](https:\u002F\u002Fgithub.com\u002FIsaacRe\u002Fvllm-kvcompress)\n\n38. [**InfiniPot：在内存受限的LLM上实现无限上下文处理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01518) *金敏洙、沈九鸿、崔正旭、昌思明。* EMNLP 2024年。\n\n39. [**Locret：通过训练好的保留头增强长上下文大语言模型推理中的逐出机制。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01805) *黄宇翔、袁斌航、韩旭、肖朝军、刘知远。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuangyuxiang03\u002FLocret)](https:\u002F\u002Fgithub.com\u002Fhuangyuxiang03\u002FLocret)\n\n40. [**SparseVLM：用于高效视觉-语言模型推理的视觉标记稀疏化。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04417) *张元、范春凯、马俊鹏、郑文钊、黄涛、程宽、丹尼斯·古多夫斯基、奥久野智之、中田洋平、库尔特·科伊策、张尚航。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGumpest\u002FSparseVLMs)](https:\u002F\u002Fgithub.com\u002FGumpest\u002FSparseVLMs)\n\n41. [**LoCoCo：在卷积中引入丢弃以实现长上下文压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.05317) *蔡瑞思、田远东、王章阳、陈蓓迪。* ICML 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-Group\u002FLoCoCo)](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FLoCoCo)\n\n42. [**DuoAttention：通过检索与流式头实现高效的长上下文大语言模型推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10819) *肖广轩、唐嘉明、左静薇、郭俊贤、杨尚、唐浩天、傅瑶、韩松。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fduo-attention)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fduo-attention)\n\n43. [**SimLayerKV：一种用于层级KV缓存缩减的简单框架。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13846) *张璇、杜存孝、杜超、庞天宇、高伟、林敏。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsail-sg\u002FSimLayerKV)](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FSimLayerKV)\n\n44. [**基于注意力门控的LLM上下文内KV缓存逐出。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12876) *曾子豪、林博凯、侯天琪、张浩、邓志杰。* Arxiv 2024年。\n\n45. [**CacheGen：用于快速大型语言模型服务的KV缓存压缩与流式处理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07240) *刘宇涵、李汉臣、程一华、西达特·雷、黄宇扬、张启正、杜坤泰、姚佳怡、陆珊、加内什·阿南塔纳拉亚南、迈克尔·梅尔、亨利·霍夫曼、阿里·霍尔茨曼、江俊辰。* ACM SIGCOMM 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLMCache\u002FLMCache)](https:\u002F\u002Fgithub.com\u002FLMCache\u002FLMCache)\n\n46. [**MagicPIG：用于高效LLM生成的LSH采样。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16179) *陈卓明、萨杜坎·拉纳乔伊、叶子豪、周洋、张建宇、尼克拉斯·诺尔特、田远东、马蒂伊斯·杜泽、莱昂·博托、贾志浩、陈蓓迪。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInfini-AI-Lab\u002FMagicPIG)](https:\u002F\u002Fgithub.com\u002FInfini-AI-Lab\u002FMagicPIG)\n\n47. [**TidalDecode：利用位置持久化稀疏注意力实现快速且准确的LLM解码。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05076) *杨立杰、张志浩、陈卓夫、李子坤、贾志浩。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDerrickYLJ\u002FTidalDecode)](https:\u002F\u002Fgithub.com\u002FDerrickYLJ\u002FTidalDecode)\n\n48. [**ShadowKV：用于高吞吐量长上下文LLM推理的影子KV缓存。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21465) *孙汉石、常丽雯、鲍文蕾、郑思泽、郑宁欣、刘欣、董哈里、迟月洁、陈蓓迪。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FShadowKV)](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FShadowKV)\n\n49. [**BUZZ：蜂巢结构的稀疏KV缓存结合分段重头，用于高效LLM推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23079) *赵俊奇、方志金、李树、杨少辉、何世超。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJunqiZhao888\u002Fbuzz-llm)](https:\u002F\u002Fgithub.com\u002FJunqiZhao888\u002Fbuzz-llm)\n\n50. [**CItruS：针对长序列建模的分块指令感知状态逐出机制。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12018) *白宇、邹希源、黄赫燕、陈三兴、马克-安托万·隆多、高阳、张杰基。* EMNLP 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fybai-nlp\u002FCItruS)](https:\u002F\u002Fgithub.com\u002Fybai-nlp\u002FCItruS)\n\n51. [**TokenSelect：通过动态的标记级KV缓存选择实现LLM的高效长上下文推理与长度外推。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02886) *吴伟、潘卓石、王超、陈立义、白云初、傅坤、王政、熊辉。* Arxiv 2024年。\n\n52. [**Recycled Attention：用于长上下文语言模型的高效推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.05787) *徐芳媛、戈雅尔·坦娅、崔恩瑟尔。* Arxiv 2024年。\n\n53. [**VL-Cache：用于加速视觉-语言模型推理的稀疏与模态感知KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23317) *涂德展、瓦什切连科·达尼洛、卢宇哲、徐盼盼。* Arxiv 2024年。\n\n54. [**Squeezed Attention：加速长上下文LLM推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09688) *胡珀·科尔曼、金世勋、穆罕默德扎德·希瓦、马赫斯瓦兰·莫尼什瓦兰、朴俊、迈克尔·W·马霍尼、库尔特·科伊策、阿米尔·戈拉米。* Arxiv 2024年。\n\n55. [**ArkVale：通过可召回的键值逐出实现高效的生成式LLM推理。**](https:\u002F\u002Fgithub.com\u002Fpku-liang\u002FArkVale\u002Fblob\u002Fmain\u002Fmedia\u002Farkvale-nips24-paper.pdf) *陈仁泽、王卓峰、曹蓓泉、吴通、郑思泽、李秀红、魏学超、严生恩、李萌、梁云。* NeurIPS 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpku-liang\u002FArkVale)](https:\u002F\u002Fgithub.com\u002Fpku-liang\u002FArkVale)\n\n56. [**并非所有头都重要：一种集成检索与推理的头级KV缓存压缩方法。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19258) *傅瑶、蔡泽凡、阿贝德尔卡迪尔·阿西、韦恩·雄、董岳、肖文。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFYYFU\u002FHeadKV)](https:\u002F\u002Fgithub.com\u002FFYYFU\u002FHeadKV)\n\n57. [**[CLS] 注意力就是训练-free视觉标记修剪所需的一切：让VLM推理更快。**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.01818) *张启哲、程傲松、陆明、卓志勇、王敏琦、曹家俊、郭绍波、佘奇、张尚航。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTheia-4869\u002FFasterVLM)](https:\u002F\u002Fgithub.com\u002FTheia-4869\u002FFasterVLM)\n\n58. [**Fit and Prune：用于多模态大型语言模型的快速且无需训练的视觉标记修剪。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10197) *叶伟浩、吴琼、林文浩、周依依。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTheia-4869\u002FFasterVLM)](https:\u002F\u002Fgithub.com\u002Fywh187\u002FFitPrune)\n\n59. [**ClusterKV：在语义空间中操控LLM KV缓存以实现可召回的压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03213) *刘广达、李成伟、赵洁茹、张晨琪、郭敏仪。* Arxiv 2024年。\n\n60. [**用LeanKV统一大型语言模型的KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03131) *张艳琪、胡宇威、赵润元、刘约翰·C.S.鲁伊、陈海波。* Arxiv 2024年。\n\n61. [**DynamicKV：面向长上下文大模型的任务感知自适应KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14838) *周侠斌、王文彬、曾敏燕、郭家贤、刘学波、沈立、张敏、丁亮。* Arxiv 2024年。\n\n62. [**SCOPE：优化长上下文生成中的键值缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.13649) *吴家龙、王正林、张林海、赖一龙、何玉兰、周德宇。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLinking-ai\u002FSCOPE)](https:\u002F\u002Fgithub.com\u002FLinking-ai\u002FSCOPE)\n\n63. [**HashEvict：一种基于局部敏感哈希的预注意力KV缓存驱逐策略。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16187) *刘明辉、塔赫辛·拉巴尼、托尼·奥哈洛伦、阿南特·桑卡林加姆、玛丽-安妮·哈特利、布赖恩·格拉维尔、黄福荣、科妮莉亚·费尔穆勒、伊安尼斯·阿洛伊莫诺斯。* Arxiv 2024年。\n\n64. [**SepLLM：通过将一个片段压缩为一个分隔符来加速大语言模型。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.12094) *陈国轩、史翰、李嘉伟、高逸航、任晓哲、陈怡萌、姜鑫、李振国、刘伟洋、黄超。* Arxiv 2024年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHKUDS\u002FSepLLM)](https:\u002F\u002Fgithub.com\u002FHKUDS\u002FSepLLM)\n\n65. [**MiniKV：通过2比特层判别式KV缓存突破大语言模型推理极限。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18077) *阿克沙特·夏尔马、丁航梁、李建平、尼尔·达尼、张敏佳。* Arxiv 2025年。\n\n66. [**FastKV：基于令牌选择性传播的快速长上下文处理KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01068) *赵东元、宋智媛、金玉花、金在俊。* Arxiv 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdongwonjo\u002FFastKV)](https:\u002F\u002Fgithub.com\u002Fdongwonjo\u002FFastKV)\n\n67. [**ChunkKV：语义保持的KV缓存压缩，用于高效长上下文大语言模型推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00299) *刘翔、唐振恒、董培杰、李泽宇、李博、胡旭明、楚晓雯。* Arxiv 2025年。\n\n68. [**LServe：基于统一稀疏注意力的高效长序列大语言模型服务。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14866) *杨尚、郭俊贤、唐浩天、胡庆浩、肖光轩、唐家铭、林宇军、刘志坚、陆瑶、韩松。* MLSys 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fomniserve)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fomniserve)\n\n69. [**RocketKV：通过两阶段KV缓存压缩加速长上下文大语言模型推理。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2502.14051) *贝赫纳姆·帕伊曼、傅耀生、赵 Ritchie、蔡博安、于志鼎、阿列克谢·图马诺夫。* Arxiv 2025年。\n\n70. [**Dynamic-LLaVA：通过动态视觉—语言上下文稀疏化实现高效的多模态大语言模型。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00876) *黄文轩、翟子杰、沈云航、曹少胜、赵飞、徐向峰、叶哲宇、林绍辉。* ICLR 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOsilly\u002Fdynamic_llava)](https:\u002F\u002Fgithub.com\u002FOsilly\u002Fdynamic_llava)\n\n71. [**DBudgetKV：在KV缓存压缩中动态预算以确保最佳性能。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16886) *倪宣凡、徐丽艳、吕晨阳、王龙悦、余墨、刘乐茂、孟凡东、周洁、李皮吉。* Arxiv 2025年。\n\n72. [**无限制对话：用于大语言模型扩展回复的恒定大小KV缓存。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00979) *拉维·加迪亚、阿维纳什·库马尔、高拉夫·贾因、普拉桑特·奈尔、保拉米·达斯。* ICML 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fghadiaravi13\u002FMorphKV)](https:\u002F\u002Fgithub.com\u002Fghadiaravi13\u002FMorphKV)\n\n73. [**MEDA：用于高效多模态长上下文推理的动态KV缓存分配。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17599) *万中伟、沈慧、王欣、刘彻、麦哲达、张密。* NAACL 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIoT-MLSys-Lab\u002FMEDA)](https:\u002F\u002Fgithub.com\u002FAIoT-MLSys-Lab\u002FMEDA)\n\n74. [**KVCrush：利用头部行为相似性进行键值缓存尺寸缩减。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00022) *戈皮·克里希纳·贾、萨梅赫·戈布里埃尔、柳博芙·塔拉马诺娃、亚历山大·科兹洛夫、尼莱什·贾因。* Arxiv 2025年。\n\n75. [**大语言模型知道该舍弃什么：自注意力引导的KV缓存驱逐以实现高效长上下文推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.08879) *王广涛、舒邦吉·乌帕萨尼、吴辰、达尔尚·甘地、乔纳森·李、胡昌然、李博、乌尔米什·塔克尔。* ICLR 2025年。\n\n76. [**SpeCache：用于大语言模型高效生成的推测性键值缓存。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16163) *解世博、唐业辉、韩凯、邓志宏、韩静。* Arxiv 2025年。\n\n77. [**KV-Distill：近乎无损可学习的大语言模型上下文压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10337) *维韦克·查里、秦广辉、本杰明·范·杜尔姆。* Arxiv 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvnchari\u002Fkv-distill)](https:\u002F\u002Fgithub.com\u002Fvnchari\u002Fkv-distill)\n\n78. [**SentenceKV：通过句子级语义KV缓存实现高效大语言模型推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00970) *朱宇轩、法拉哈蒂·阿里、杨大卫、穆罕默德·穆罕默迪·阿米里。* Arxiv 2025年。\n\n79. [**稀疏前沿：Transformer大语言模型中的稀疏注意力权衡。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.17768) *皮奥特尔·纳夫罗特、罗伯特·李、黄仁杰、塞巴斯蒂安·鲁德尔、凯莉·马尔基西奥、爱德华多·M·蓬蒂。* Arxiv 2025年。\n\n80. [**FreqKV：用于高效扩展上下文窗口的频域键值压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00570) *凯居士、曾博艺、王一轩、白浩利、江波、林周涵。* Arxiv 2025年。\n\n81. [**Mustafar：促进非结构化稀疏性以用于大语言模型推理中的KV缓存修剪。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22913) *朱东贤、霍赛莉·侯赛尼、拉米亚德·哈迪迪、巴哈尔·阿斯加里。* Arxiv 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdhjoo98\u002Fmustafar)](https:\u002F\u002Fgithub.com\u002Fdhjoo98\u002Fmustafar)\n    \n82. [**CAKE：基于层级偏好的级联与自适应KV缓存驱逐。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.12491) *秦子然、曹雨晨、林明宝、胡文、范世轩、程科、林伟尧、李建国。* ICLR 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fantgroup\u002Fcakekv)](https:\u002F\u002Fgithub.com\u002Fantgroup\u002Fcakekv)\n\n83. [**R-KV：面向无需训练的推理模型加速的冗余感知KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24133) *蔡泽凡、肖文、孙汉诗、罗成、张易凯、万科、李宇成、周业扬、常立文、顾久祥、董震、阿尼玛·阿南德库马尔、阿贝德尔卡迪尔·阿西、胡俊杰。* Arxiv 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZefan-Cai\u002FR-KV)](https:\u002F\u002Fgithub.com\u002FZefan-Cai\u002FR-KV)\n\n84. [**KVzip：不依赖查询的上下文重建KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23416) *金章贤、金珍淑、权相佑、李在W、尹相斗、宋贤欧。* Arxiv 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnu-mllab\u002FKVzip)](https:\u002F\u002Fgithub.com\u002Fsnu-mllab\u002FKVzip)\n\n85. [**同质密钥、异质值：利用局部KV缓存的不对称性提升长上下文LLM的推理效率。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2506.05410) *崔万云、徐明伟.* Arxiv 2025年。\n\n86. [**AhaKV：自适应的整体注意力驱动KV缓存驱逐策略，用于高效推理大型语言模型。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2506.03762) *顾一峰、蒋子聪、金建秀、郭凯玲、张子阳、徐向敏.* Arxiv 2025年。\n\n87. [**基于KV缓存压缩的推理时超尺度扩展。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05345) *阿德里安·兰丘茨基、孔拉德·斯塔尼舍夫斯基、皮奥特尔·纳夫罗特、爱德华多·M·蓬蒂.* Arxiv 2025年。\n\n88. [**InfiniPot-V：面向流式视频理解的内存受限KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15745v1) *金民洙、沈奎洪、崔正旭、张思明.* Arxiv 2025年。\n\n89. [**“Cache Me If You Can”：有效长上下文LLM需要多少个KV？**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17121) *阿迪提亚·巴斯卡尔、亚历山大·韦蒂格、高天宇、董艺赫、陈丹琪.* Arxiv 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-pli\u002FPruLong)](https:\u002F\u002Fgithub.com\u002Fprinceton-pli\u002FPruLong)\n\n90. [**Compactor：基于近似杠杆分数的校准型查询无关KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.08143) *维韦克·查里、本杰明·范杜尔姆.* Arxiv 2025年。\n\n91. [**LaCache：用于高效长上下文建模的梯级式KV缓存策略，适用于大型语言模型。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.14204) *史大川、傅永干、袁向驰、于中志、游浩然、李思旭、董欣、扬·考茨、帕夫洛·莫尔恰诺夫、林英燕（塞琳）.* ICML 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGATECH-EIC\u002FLaCache)](https:\u002F\u002Fgithub.com\u002FGATECH-EIC\u002FLaCache)\n\n92. [**EvolKV：用于LLM推理的进化式KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08315) *余博涵、柴业坤.* Arxiv 2025年。\n\n93. [**LAVa：基于动态预算分配的分层KV缓存驱逐策略。**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.09754) *沈怡群、袁松、张正泽、王小亮、姜大鑫、阮甘图.* Arxiv 2025年。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMGDDestiny\u002FLava)](https:\u002F\u002Fgithub.com\u002FMGDDestiny\u002FLava)\n\n94. [**KeyDiff：基于密钥相似性的KV缓存驱逐策略，适用于资源受限环境下的长上下文LLM推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15364) *朴俊荣、达尔顿·琼斯、马修·J·莫尔斯、拉加夫·戈埃尔、李民九、克里斯·洛特.* NeurIPS 2025年。\n\n95. [**CAOTE：基于注意力输出误差的标记驱逐策略，用于LLM的KV缓存选择。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14051) *拉加夫·戈埃尔、朴俊荣、穆库尔·加格拉尼、达尔顿·琼斯、马修·莫尔斯、哈珀·兰斯顿、李民九、克里斯·洛特.* Arxiv 2025年。\n\n### 3️⃣ 跨层\n\n1. [**仅缓存一次：面向语言模型的解码器—解码器架构。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.05254) *孙宇涛、董立、朱毅、黄少涵、王文辉、马树明、张全路、王建勇、魏福如。* NeurIPS 2024。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002FYOCO)\n   \n2. [**利用跨层注意力减少 Transformer 的键值缓存大小。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.12981) *威廉·布兰登、马扬克·米什拉、阿尼鲁达·恩鲁西姆哈、拉梅斯瓦尔·潘达、乔纳森·拉根·凯利。* Arxiv 2024。\n   \n3. [**用于高效推理大型语言模型的层压缩键值缓存。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.10637) *吴浩义、涂可伟。* ACL 2024。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FwhyNLP\u002FLCKV)](https:\u002F\u002Fgithub.com\u002FwhyNLP\u002FLCKV)\n\n4. [**MiniCache：针对大型语言模型的深度维度键值缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14366) *刘阿基德、刘晶、潘子正、何业飞、戈拉姆雷扎·哈法里、庄博涵。* Arxiv 2024。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAkideLiu\u002FMiniCache)](https:\u002F\u002Fgithub.com\u002FAkideLiu\u002FMiniCache)\n\n5. [**MLKV：用于内存高效 Transformer 解码的多层键值头。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09297) *扎伊德·穆罕默德·卡瓦基比·祖赫里、穆罕默德·法里德·阿迪拉祖尔达、阿尤·普尔维安蒂、阿尔哈姆·菲克里·阿吉。* Arxiv 2024。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzaydzuhri\u002Fpythia-mlkv)](https:\u002F\u002Fgithub.com\u002Fzaydzuhri\u002Fpythia-mlkv)\n\n6. [**关于跨层键值共享以实现高效 LLM 推理的系统性研究。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.14442) *吴优、吴浩义、涂可伟。* Arxiv 2024。\n\n7. [**KVSharer：通过逐层异构键值缓存共享实现高效推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18517) *杨一飞、曹邹英、陈奇光、秦立波、杨东杰、赵海、陈志。* Arxiv 2024。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangyifei729\u002FKVSharer)](https:\u002F\u002Fgithub.com\u002Fyangyifei729\u002FKVSharer)\n\n8. [**SwiftKV：基于知识保持的模型转换实现快速预填充优化推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03960) *奥里克·乔、姚哲伟、萨米亚姆·拉吉班达里、何宇雄。* Arxiv 2024。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnowflakedb\u002Farctictraining)](https:\u002F\u002Fgithub.com\u002Fsnowflakedb\u002Farctictraining)\n\n9. [**利用层间注意力相似性压缩键值缓存以支持长上下文 LLM 推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02252) *马达、陈陆、张思拓、苗宇迅、朱苏、陈志、徐洪深、李汉琪、范帅、潘磊、于凯。* Arxiv 2024。 \n\n### 4️⃣ 低秩\n\n1. [**快速 Transformer 解码：只需一个写头即可。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.02150) *诺姆·沙泽尔。* Arxiv 2019。\n   \n2. [**GQA：从多头检查点训练通用多查询 Transformer 模型。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13245) *乔舒亚·艾恩斯利、詹姆斯·李-索普、米希尔·德·容、尤里·泽姆良斯基、费德里科·莱布隆、苏米特·桑盖。* EMNLP 2023。\n\n3. [**DeepSeek-V2：一款强大、经济且高效的专家混合语言模型。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04434) *DeepSeek-AI。* Arxiv 2024。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-V2)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V2)\n\n4. [**有效压缩 LLM 的键值头。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07056) *于浩、杨泽兰、李申、李勇、吴建新。* Arxiv 2024。\n\n5. [**Palu：利用低秩投影压缩键值缓存。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21118) *张志智、王旷、刘丽媛、王硕航、程浩、张超、沈叶龙。* Arxiv 2024。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshadowpa0327\u002FPalu)](https:\u002F\u002Fgithub.com\u002Fshadowpa0327\u002FPalu)\n\n6. [**LoRC：采用渐进式压缩策略的 LLM 键值缓存低秩压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03111) *张荣志、王旷、刘丽媛、王硕航、程浩、张超、沈叶龙。* Arxiv 2024。\n\n7. [**张量积注意力就是你需要的一切。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06425) *张一凡、刘一峰、袁慧卓、秦振、袁阳、顾全全、姚昌昌。* Arxiv 2025。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftensorgi\u002FT6)](https:\u002F\u002Fgithub.com\u002Ftensorgi\u002FT6)\n\n8. [**ThinK：通过查询驱动的剪枝实现更薄的键值缓存。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21018) *徐宇辉、揭展明、董汉泽、王磊、卢旭东、周傲俊、萨哈·阿姆丽塔、熊才明、萨胡·多延。* Arxiv 2024。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSalesforceAIResearch\u002FThinK)](https:\u002F\u002Fgithub.com\u002FSalesforceAIResearch\u002FThinK)\n\n9. [**超越同质注意力：通过傅里叶近似键值缓存实现内存高效的 LLM。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11886) *刘晓然、何思洋、王奇奇、李瑞晓、宋玉蓉、刘志耿、李林林、刘群、黄增峰、郭启鹏、何子威、邱希鹏。* Arxiv 2025。\n\n10. [**OjaKV：基于 Oja 规则的上下文感知在线低秩键值缓存压缩。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2509.21623) *朱宇轩、杨大卫、穆罕默德·莫哈马迪·阿米里、基尔提拉姆·穆鲁格桑、特贾斯维尼·佩达帕蒂、陈品宇。* Arxiv 2025。\n\n### 5️⃣ 量化\n\n1. [**ZipCache：基于显著标记识别的精准高效KV缓存量化。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2405.14256) *何叶飞、张洛明、吴伟佳、刘静、周红、庄博涵.* Arxiv 2024。\n\n2. [**一个标记也不落下：基于重要性感知的混合精度量化实现可靠的KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.18096) *杨俊勇、金炳旭、裴正仁、权范锡、朴根浩、杨恩浩、权世荣、李东洙.* Arxiv 2024。\n\n3. [**KIVI：无需调优的KV缓存非对称2位量化。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.02750) *刘子睿、袁嘉怡、金洪烨、钟绍辰、徐兆卓、弗拉基米尔·布拉韦尔曼、陈蓓迪、胡霞.* ICML 2024。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjy-yuan\u002FKIVI)](https:\u002F\u002Fgithub.com\u002Fjy-yuan\u002FKIVI)\n\n4. [**GEAR：面向大语言模型生成式推理的近无损KV缓存高效压缩方案。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F203.05527) *康浩、张庆如、苏维克·昆杜、郑建华、刘兆兴、图沙尔·克里希纳、赵拓.* Arxiv 2024。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopengear-project\u002FGEAR)](https:\u002F\u002Fgithub.com\u002Fopengear-project\u002FGEAR)\n\n5. [**PQCache：基于乘积量化的大规模上下文LLM推理KV缓存。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.12820) *张海林、季晓东、陈一琳、傅方成、苗旭鹏、聂晓楠、陈伟鹏、崔斌.* Arxiv 2024。\n\n6. [**利用矩阵分解解锁无数据低比特量化，用于KV缓存压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.12591) *刘培宇、高泽峰、赵wayne鑫、马一鹏、王涛、温继荣.* Arxiv 2024。\n\n7. [**SKVQ：面向大语言模型的滑动窗口键值缓存量化。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.06219) *段木浩杰、袁志航、李秀红、段江飞、张兴成、林大华.* Arxiv 2024。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcat538\u002FSKVQ)](https:\u002F\u002Fgithub.com\u002Fcat538\u002FSKVQ)\n\n8. [**QAQ：面向LLM KV缓存的质量自适应量化。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.04643) *董诗晨、程文、秦家玉、王伟.* Arxiv 2024。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FClubieDong\u002FQAQ-KVCacheQuantization)](https:\u002F\u002Fgithub.com\u002FClubieDong\u002FQAQ-KVCacheQuantization)\n\n9. [**KVQuant：迈向1000万上下文长度的LLM推理，借助KV缓存量化。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.18079) *霍珀·科尔曼、金世勋、穆罕默德扎德·希瓦、迈克尔·W·马霍尼、邵雅坤·索菲娅、库尔特·凯茨尔、阿米尔·戈拉米.* NeurIPS 2024。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSqueezeAILab\u002FKVQuant)](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FKVQuant)\n\n10. [**WKVQuant：为大语言模型量身定制的权重与键值缓存量化，效果更佳。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12065) *岳雨轩、袁志航、段木浩杰、周思凡、吴建龙、聂立强.* Arxiv 2024。\n\n11. [**KVTuner：基于敏感度感知的逐层混合精度KV缓存量化，实现高效且近乎无损的LLM推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04420) *李星、邢泽宇、李一鸣、曲林平、甄慧玲、刘武龙、姚一武、潘信诺·贾林、袁明轩.* ICML 2025。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcmd2001\u002FKVTuner)](https:\u002F\u002Fgithub.com\u002Fcmd2001\u002FKVTuner)\n\n12. [**即插即用的1.x位KV缓存量化，适用于视频大语言模型。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16257) *陶科达、游浩轩、隋阳、秦灿、王欢.* Arxiv 2025。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKD-TAO\u002FVidKV)](https:\u002F\u002Fgithub.com\u002FKD-TAO\u002FVidKV)\n\n13. [**BitDecoding：解锁张量核心，以低比特KV缓存支持长上下文LLM解码。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18773) *杜大友、曹士杰、程建义、曹婷、杨茂.* Arxiv 2025。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDD-DuDa\u002FBitDecoding)](https:\u002F\u002Fgithub.com\u002FDD-DuDa\u002FBitDecoding)\n\n14. [**TaDA：无需训练的自适应KV缓存压缩与均值中心化解码方案。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2506.04642) *乔希·维奈、布拉赫玛·普拉蒂克·普拉班詹、刘子诚、埃马德·巴尔苏姆*。ACL 2025产业赛道。\n\n15. [**TailorKV：通过定制化KV缓存优化实现长上下文推理的混合框架。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19586) *姚鼎宇、沈博文、林正、刘伟、栾健、王斌、王伟平.* ACL 2025。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fydyhello\u002FTailorKV)](https:\u002F\u002Fgithub.com\u002Fydyhello\u002FTailorKV)\n\n16. [**CommVQ：用于KV缓存压缩的交换向量量化。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.18879) *李俊燕、张洋、穆罕默德·尤素夫·哈桑、塔尔哈·查费卡尔、蔡天乐、任志乐、郭鹏升、福鲁赞·卡里姆扎德、科罗拉多·里德、王冲、甘创.* ICML 2025。\n\n\n\n### 6️⃣ 提示压缩\n\n1. [**LLMLingua：压缩提示以加速大语言模型推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05736) *姜辉强、吴千慧、林钦佑、杨玉清、邱莉莉.* EMNLP 2023。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)\n\n2. [**LLMLingua-2：数据蒸馏实现高效且忠实的任务无关提示压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12968) *潘卓石、吴千慧、姜辉强、夏梦琳、罗旭芳、张珏、林青伟、维克托·吕勒、杨玉清、林钦佑、赵H·维姬、邱莉莉、张冬梅.* ACL 2024。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)\n\n3. [**LongLLMLingua：通过提示压缩加速并提升大语言模型在长上下文场景中的表现。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06839) *姜辉强、吴千慧、罗旭芳、李东生、林钦佑、杨玉清、邱莉莉.* ACL 2024。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLMLingua)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLMLingua)\n\n4. [**TACO-RL：基于强化学习的任务感知提示压缩优化。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.13035) *尚迪姆·桑迪利亚、夏梦琳、苏普里约·戈什、姜辉强、张珏、吴千慧、维克托·吕勒.* Arxiv 2024。\n\n5. [**ICPC：上下文内提示压缩，加速推理。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01625) *于子洋、刘宇宇.* Arxiv 2025。\n\n6. [**无需多层感知机的更好提示压缩。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06730) *爱德华多·霍尼格、安德鲁·利萨拉加、张子君·弗兰克、吴颖年.* Arxiv 2025。\n\n\n### 7️⃣ 复用\n\n1. [**KVLink：通过高效KV缓存复用加速大语言模型。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2502.16002) *杨景波、侯百儒、魏巍、鲍宇佳、常诗雨.* Arxiv 2025。[![GitHub仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUCSB-NLP-Chang\u002FKVLink)](https:\u002F\u002Fgithub.com\u002FUCSB-NLP-Chang\u002FKVLink)\n\n### 8️⃣ 非自回归\n\n1. [**dKV-Cache：扩散语言模型的缓存机制。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15781) *马欣欣、于润鹏、方功凡、王新超。* Arxiv 2025年。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhorseee\u002FdKV-Cache)](https:\u002F\u002Fgithub.com\u002Fhorseee\u002FdKV-Cache)\n\n2. [**dLLM-Cache：通过自适应缓存加速扩散大型语言模型。**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2506.06295) *刘志远、杨一村、张耀杰、陈俊杰、邹畅、魏青岩、王少波、张林峰。* Arxiv 2025年。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmaomaocun\u002FdLLM-cache)](https:\u002F\u002Fgithub.com\u002Fmaomaocun\u002FdLLM-cache)\n\n3. [**Fast-dLLM：通过启用KV缓存与并行解码实现无需训练的扩散型LLM加速。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22618) *吴成悦、张浩、薛淑晨、刘志坚、刁世哲、朱立根、罗平、韩松、谢恩泽。* Arxiv 2025年。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FFast-dLLM)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFast-dLLM)\n\n\n## 📊 评估\n\n1. [**KV缓存压缩，但我们需要付出什么代价？长上下文能力方法的全面基准测试。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.01527) *袁嘉怡、刘鸿毅、钟绍辰（Henry）·仲、庄宇能、李松晨、王冠楚、黎杜、金洪烨、维平·乔杜里、徐兆卓、刘子睿、胡霞。* EMNLP 2024年。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhenryzhongsc\u002Flongctx_bench)](https:\u002F\u002Fgithub.com\u002Fhenryzhongsc\u002Flongctx_bench)\n\n2. [**SCBench：以KV缓存为中心的长上下文方法分析。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10319) *李宇成、江慧强、吴千慧、罗旭芳、安叙仁、张成瑞东、阿米尔·H·阿卜迪、李东升、高建峰、杨玉清、邱莉莉。* Arxiv 2024年。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMInference)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMInference\u002Ftree\u002Fmain\u002Fscbench)\n\n3. [**更多token，更低精度：迈向KV缓存压缩中的最优token-精度权衡。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.12706) *张洁斌、朱大伟、宋一帆、吴文浩、匡楚桥、李晓光、尚立峰、刘群、李苏健。* Arxiv 2024年。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhzihao\u002FQPruningKV)](https:\u002F\u002Fgithub.com\u002Fzhzihao\u002FQPruningKV)\n\n4. [**重新思考用于大型语言模型服务的键值缓存压缩技术。**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24000) *高伟、周欣宇、孙鹏、张天威、温永刚。* Arxiv 2025年。[![GitHub 仓库星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLMkvsys\u002Frethink-kv-compression)](https:\u002F\u002Fgithub.com\u002FLLMkvsys\u002Frethink-kv-compression)","# Awesome-KV-Cache-Compression 快速上手指南\n\n> 一站式 KV-Cache 压缩论文、代码与工具导航仓库，持续更新中。\n\n---\n\n## 环境准备\n- **系统**：Linux \u002F macOS \u002F Windows WSL2  \n- **Python**：≥ 3.8  \n- **PyTorch**：≥ 2.0（CUDA 11.8+ 推荐）  \n- **GPU**：NVIDIA GPU（≥ 8 GB 显存，Ampere 架构以上最佳）  \n- **网络**：可访问 GitHub \u002F Hugging Face（国内建议配置镜像）\n\n---\n\n## 安装步骤\n\n1. 克隆仓库  \n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression.git\n   cd Awesome-KV-Cache-Compression\n   ```\n\n2. 安装核心依赖（以 kvpress 为例）  \n   ```bash\n   # 创建虚拟环境\n   python -m venv kv_env\n   source kv_env\u002Fbin\u002Factivate  # Windows: kv_env\\Scripts\\activate\n\n   # 安装 kvpress（含 transformers）\n   pip install -U pip\n   pip install kvpress\n   ```\n\n3. 国内加速（可选）  \n   ```bash\n   # 清华镜像\n   pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple kvpress\n   # Hugging Face 镜像\n   export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n   ```\n\n---\n\n## 基本使用\n\n### 1. 快速体验 kvpress（一行代码压缩 KV-Cache）\n```python\nfrom kvpress import SnapKVPress\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_name = \"microsoft\u002FDialoGPT-medium\"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=\"auto\", device_map=\"auto\")\n\n# 使用 SnapKV 压缩 20% KV-Cache\npress = SnapKVPress(compression_ratio=0.2)\nmodel = press(model)\n\n# 正常推理即可\ninputs = tokenizer(\"你好，世界\", return_tensors=\"pt\").to(model.device)\nout = model.generate(**inputs, max_new_tokens=20)\nprint(tokenizer.decode(out[0]))\n```\n\n### 2. 批量测试压缩效果\n```bash\n# 运行官方 benchmark（需 GPU）\npython -m kvpress.benchmark --model microsoft\u002FDialoGPT-medium --presses snap,h2o --dataset c4\n```\n\n---\n\n> 更多方法（PyramidKV、H2O、StreamingLLM 等）的代码与示例，直接点击仓库对应链接即可跳转。","一家 8 人 AI 创业团队正在把 70B 参数的 Llama-3 部署到 4 张 A100 上，做面向 10 万日活用户的智能客服 SaaS。  \n### 没有 Awesome-KV-Cache-Compression 时\n- 显存吃紧：单条 4k token 对话 KV Cache 占 3.2 GB，4 卡只能并发 12 条请求，高峰期排队 30 秒以上。  \n- 盲目试错：团队零散地搜论文、跑实验，两周才试完 H2O 一种方法，结果压缩率 30 % 仍不达标。  \n- 工程踩坑：把 SnapKV 集成到 transformers 时接口对不上，调试 3 天才发现版本冲突。  \n- 成本飙升：为了撑住并发，临时又租 4 张 A100，月账单多出 1.2 万美元。  \n\n### 使用 Awesome-KV-Cache-Compression 后\n- 显存释放：照着仓库里的 kvpress 脚本，10 分钟跑通 PyramidInfer，KV Cache 降到 0.8 GB，并发提升到 48 条，排队时间 \u003C 2 秒。  \n- 快速选型：阅读仓库整理的对比表，30 分钟锁定 Scissorhands + SnapKV 组合，压缩率 75 %，BLEU 只掉 1.3。  \n- 零踩坑：直接复制仓库提供的 transformers patch，一行命令安装依赖，当天上线灰度。  \n- 成本腰斩：无需加卡，4 张 A100 轻松扛住峰值，月租费回到 6 千美元。  \n\n核心价值：Awesome-KV-Cache-Compression 把压缩方案的调研、验证和落地时间从“周”压缩到“小时”，让中小团队也能低成本跑大模型。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOctober2001_Awesome-KV-Cache-Compression_f8a2cb2a.png","October2001","Longze Chen","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FOctober2001_ab786442.jpg","Phd student @ SIAT-NLP","University of Chinese Academy of Sciences","Beijing, China",null,"https:\u002F\u002Fgithub.com\u002FOctober2001",679,23,"2026-04-01T11:36:20","MIT",1,"Linux","需要 NVIDIA GPU，显存 8GB+，CUDA 11.7+","未说明",{"notes":92,"python":90,"dependencies":93},"该仓库为 Awesome 列表，本身不直接运行代码；所列项目（如 kvpress、KVCache-Factory 等）需各自查看对应仓库的运行环境说明。",[94,95],"transformers","torch",[15],[98,99,100],"awesome-list","large-language-models","papers","2026-03-27T02:49:30.150509","2026-04-06T08:17:46.579753",[104,109,114,119,123,127],{"id":105,"question_zh":106,"answer_zh":107,"source_url":108},6157,"如何向 Awesome-KV-Cache-Compression 列表提交新的仓库或论文？","在 GitHub 上新建 Issue，标题可写为“request: add xxx”或“a relevant paper and code repo!”。正文需包含：\n1. 仓库或论文名称\n2. 仓库链接（GitHub URL）\n3. 论文链接（arXiv 或 PDF）\n4. 建议的分类（如 Pruning \u002F Evicting \u002F Sparse）\n维护者通常会在 1-2 天内回复并合并。","https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression\u002Fissues\u002F3",{"id":110,"question_zh":111,"answer_zh":112,"source_url":113},6158,"提交后多久能被合并进列表？","根据已有案例，维护者 October2001 会在 1 天内回复“Sure, I've finished it”或“Thank you! I have added it...”，并关闭 Issue，内容即出现在 README 中。","https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression\u002Fissues\u002F9",{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},6159,"是否必须提供代码仓库才能被收录？","不是必须，但强烈建议同时给出论文和代码仓库链接。已收录的案例（如 LaCache、Context-Memory）均同时提供了 arXiv 论文和 GitHub 代码。","https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression\u002Fissues\u002F1",{"id":120,"question_zh":121,"answer_zh":122,"source_url":113},6160,"列表对论文的会议\u002F期刊有要求吗？","没有硬性要求，但已收录的论文多来自 ICLR、ICML 等顶会。只要与 KV-Cache 压缩相关即可提交。",{"id":124,"question_zh":125,"answer_zh":126,"source_url":108},6161,"可以一次提交多个项目吗？","建议每个 Issue 只提交一个项目，便于维护者跟踪和快速合并。多个项目可分多条 Issue 提交。",{"id":128,"question_zh":129,"answer_zh":130,"source_url":118},6162,"如果项目已更新，如何同步最新信息？","重新开一条 Issue，标题注明“update: xxx”，在正文中说明需要更新的内容（如新版本、新链接），维护者会按同样流程处理。",[]]