[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NVIDIA--kvpress":3,"tool-NVIDIA--kvpress":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",142651,2,"2026-04-06T23:34:12",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":32,"env_os":96,"env_gpu":97,"env_ram":98,"env_deps":99,"category_tags":106,"github_topics":107,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":116,"updated_at":117,"faqs":118,"releases":148},4951,"NVIDIA\u002Fkvpress","kvpress","LLM KV cache compression made easy","kvpress 是一款专为优化大语言模型长文本处理而设计的开源工具，旨在让 KV 缓存压缩变得简单高效。在处理百万级 token 的超长上下文时，传统 Transformer 模型的键值（KV）缓存会线性增长，导致显存需求激增（例如 Llama 3.1-70B 处理 1M token 需高达 330GB 内存），这使得部署成本极高且难以落地。kvpress 通过集成多种先进的压缩算法，在模型预填充阶段甚至生成过程中动态压缩缓存，显著降低显存占用，同时尽量保持模型回答的准确性。\n\n该工具特别适合 AI 研究人员和开发者使用。它基于 🤗 transformers 生态构建，提供了便捷的管道接口，用户只需几行代码即可调用如“期望注意力压缩”等策略，无需深入底层实现就能快速验证新算法或部署长上下文应用。此外，kvpress 还创新性地支持实验性的“解码期压缩”功能，允许在文本生成过程中周期性清理缓存，进一步突破长度限制。无论是希望降低推理成本的工程师，还是致力于探索更高效压缩机制的研究者，kvpress 都提供了一个灵活、易用的基准平台，助力长上下文大模型的普及与优化。","[![PyPI version](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fkvpress.svg)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fkvpress)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n[![Colab example notebook](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1JNvaTKuuAHrl49dYB9-mdEH_y52Ib-NP?usp=drive_link)\n[![Hugging Face Space](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20Hugging%20Face-Space-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnvidia\u002Fkvpress)\n[![Blog post](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20Hugging%20Face-Blog-blue)](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fnvidia\u002Fkvpress)\n[![Hugging Face Leaderboard](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20HuggingFace-Leaderboard-orange)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnvidia\u002Fkvpress-leaderboard)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2510.00636-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.00636v1)\n\n\n![kvpress](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_kvpress_readme_dff5230412fb.jpg)\n\n\nDeploying long-context LLMs is costly due to the linear growth of the key-value (KV) cache in transformer models. For example, handling 1M tokens with Llama 3.1-70B in float16 requires up to 330GB of memory. kvpress implements multiple KV cache compression methods and benchmarks using 🤗 transformers, aiming to simplify the development of new methods for researchers and developers in this field.\n\n## Installation\n\n```bash\npip install kvpress\n```\n\nFor a local installation, use [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F):\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress.git\ncd kvpress\nuv sync\n```\n\nTo install with all optional dependencies, run:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress.git\ncd kvpress\nuv sync --extra eval --extra flash-attn\n```\n\n## Usage\n\nKVPress provides a set of \"presses\" that compress the KV cache during the prefilling-phase. Each press is associated with a `compression_ratio` attribute that measures the compression of the cache. The easiest way to use a press is through our custom `KVPressTextGenerationPipeline`. It is automatically registered as a transformers pipeline with the name \"kv-press-text-generation\" when kvpress is imported and handles chat templates and tokenization for you:\n\n```python\nfrom transformers import pipeline\nfrom kvpress import ExpectedAttentionPress\n\nmodel = \"Qwen\u002FQwen3-8B\"\npipe = pipeline(\"kv-press-text-generation\", model=model, device_map=\"auto\", dtype=\"auto\")\n\ncontext = \"A very long text you want to compress once and for all\"\nquestion = \"\\nA question about the compressed context\"  # optional\n\npress = ExpectedAttentionPress(compression_ratio=0.5)\nanswer = pipe(context, question=question, press=press)[\"answer\"]\n```\n\nIn the snippet above, the compression is only applied on the context tokens so that you can evaluate the compression for different questions. Check the [Wikipedia notebook demo](notebooks\u002Fwikipedia_demo.ipynb) for a more detailed example (also available on Colab [here](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1JNvaTKuuAHrl49dYB9-mdEH_y52Ib-NP)).\n\n\u003Cdetails>\u003Csummary>\nDecoding Compression\n\u003C\u002Fsummary>\nBy default, KVPress applies compression during the prefilling phase. As a new (experimental) feature, we now support decoding compression via the `DecodingPress` wrapper. `DecodingPress` compresses the KV cache periodically during token generation, optionally maintaining a buffer of recent hidden states. `DecodingPress` supports the following parameters:\n\n- `base_press`: Any ScorerPress (e.g., `KNormPress`, `CriticalKVPress`)\n- `compression_interval`: Steps between compressions (default: 10)\n- `target_size`: Target cache size of the cache after compression (default: 1024)\n- `hidden_states_buffer_size`: Number of hidden states to buffer before compression (default: 128). Some presses don't need buffered hidden states and can set this to 0.\n\nUnlike a compression ratio, decoding press uses a `target_size` to compress the cache. This means that the cache is compressed every `compression_interval` steps, and the compression ratio is automatically computed such that the size of the cache after compression equals `target_size`.\n\nAn example for decoding compression:\n\n```python\nfrom transformers import pipeline\nfrom kvpress import KnormPress\nfrom kvpress import DecodingPress\n\n# Initialize the pipeline\ndevice = \"cuda:0\"\nmodel = \"meta-llama\u002FLlama-3.1-8B-Instruct\"\nmodel_kwargs = {\"attn_implementation\": \"flash_attention_2\"}\npipe = pipeline(\"kv-press-text-generation\", model=model, device=device, model_kwargs=model_kwargs)\n\n# Create a decoding press that compresses every 10 steps to 512 tokens\ndecoding_press = DecodingPress(\n    base_press=KnormPress(),\n    compression_steps=10,\n    token_buffer_size=512\n)\n\n# Use with pipeline\ncontext = \"A very long text you want to compress during generation\"\nquestion = \"Tell me a long story about this context\"\nresponse = pipe(context, question=question, press=decoding_press)[\"answer\"]\n```\n\n> Not all existing presses are fully compatible with DecodingPress due to fundamental differences in how compression works during decoding versus prefilling. in particular, we only support ScorerPresses as base presses.\n\n\u003C\u002Fdetails>\n\n## Available presses\n\nAll current presses are training free and inherit from `BasePress` ([source](kvpress\u002Fpresses\u002Fbase_press.py)). \n\nSeveral presses inherit from `ScorerPress` ([source](kvpress\u002Fpresses\u002Fscorer_press.py)) and rely on a score to prune the KV pairs with lowest importance:\n\n- `RandomPress` ([source](kvpress\u002Fpresses\u002Frandom_press.py)): random score\n- `KnormPress` ([source](kvpress\u002Fpresses\u002Fknorm_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.11430)): inverse norm of the key\n- `SnapKVPress` ([source](kvpress\u002Fpresses\u002Fsnapkv_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14469)): average attention weight of the last queries\n- `ExpectedAttentionPress` ([source](kvpress\u002Fpresses\u002Fexpected_attention_press.py), [notebook](notebooks\u002Fexpected_attention.ipynb)): expected attention weight during the generation phase \n- `StreamingLLMPress` ([source](kvpress\u002Fpresses\u002Fstreaming_llm_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17453)): keep only the initial and recent tokens \n- `TOVAPress` ([source](kvpress\u002Fpresses\u002Ftova_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06104)): attention weight of the last query averaged across heads \n- `ObservedAttentionPress` ([source](kvpress\u002Fpresses\u002Fobserved_attention_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14048)): average attention weight observed during in prefilling phase\n- `QFilterPress` ([source](kvpress\u002Fpresses\u002Fqfilter_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02812)): project the Key representations on the main SVD component of the Query vectors to approximate the attention scores.\n- `PyramidKVPress` ([source](kvpress\u002Fpresses\u002Fpyramidkv_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.02069)): maintain pyramid-like cache sizes, allocating more cache budget to lower layers and less to higher layers\n- `LagKVPress` ([source](kvpress\u002Fpresses\u002Flagkv_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.04704)): leverage on the KV lag-relative information to compress. It's query free, attention-weight free, and flash-attention compatible.\n- `KeyDiffPress` ([source](kvpress\u002Fpresses\u002Fkeydiff_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15364)): evict tokens based solely on key similarity.\n- `NonCausalAttnPress` ([source](kvpress\u002Fpresses\u002Fnon_causal_attention_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.08143)): evict tokens based on non-causal chunked attention scores.\n- `LeverageScorePress` ([source](kvpress\u002Fpresses\u002Fleverage_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.08143)): evict tokens based on approximate statistical leverage (i.e we preserve outliers in the key space).\n- `CompactorPress` ([source](kvpress\u002Fpresses\u002Fcompactor_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.08143)): blend `NonCausalAttnPress` and `LeverageScorePress` based on the compression_ratio.\n- `CURPress` ([source](kvpress\u002Fpresses\u002Fcur_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.15038)): prune keys and values based on the CUR decomposition using approximate leverage scores.\n- `KVzapPress` ([source](kvpress\u002Fpresses\u002Fkvzap\u002Fkvzap_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.07891), [training](kvzap)): approximate KVzip+ using a fast surrogate model. To be used in conjunction with the `DMSPress`.\n- `FastKVzipPress` ([source](kvpress\u002Fpresses\u002Ffastkvzip_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.17668)): approximate KVzip through a lightweight gating mechanism.\n\nSome presses rely on a different logic:\n- `ThinKPress` ([source](kvpress\u002Fpresses\u002Fthink_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21018)): compress the dimensions of the keys based on the channel attention score on the last queries \n- `SimLayerKVPress` ([source](kvpress\u002Fpresses\u002Fsimlayerkv_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13846)): identify \"lazy\" layers, and apply the StreamingLLM approach to them \n- `DuoAttentionPress` ([source](kvpress\u002Fpresses\u002Fduo_attention_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10819)): split heads into retrieval heads (no compression) and streaming heads (StreamingLLM approach)\n- `FinchPress` ([source](kvpress\u002Fpresses\u002Ffinch_press.py), [paper](https:\u002F\u002Fdirect.mit.edu\u002Ftacl\u002Farticle\u002Fdoi\u002F10.1162\u002Ftacl_a_00716\u002F125280)): similar to SnapKV with a dynamic window size and key value re-rotation\n- `KVzipPress` ([source](kvpress\u002Fpresses\u002Fkvzip_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23416)): identify redundant KV pairs through context reconstruction. Achieve near-lossless compression at the cost of multiple forward passes.\n\nFinally we provide wrapper presses that can be combined with other presses:\n- `AdaKVPress` ([source](kvpress\u002Fpresses\u002Fadakv_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11550)): prune bottom scores of any `ScorerPress` but across all heads, achieving head-wise compressions \n- `PerLayerCompressionPress` ([source](kvpress\u002Fpresses\u002Fper_layer_compression_press.py)): compress each layer with a different compression ratio (experimental)\n- `ComposedPress` ([source](kvpress\u002Fpresses\u002Fcomposed_press.py)): compose multiple presses together by chaining their forward hooks\n- `KeyRerotationPress` ([source](kvpress\u002Fpresses\u002Fkey_rerotation_press.py)): rerotate pruned keys to have continuous RoPE embeddings\n- `ChunkKVPress` ([source](kvpress\u002Fpresses\u002Fchunkkv_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00299)): compress by selecting important chunks, preserving semantic coherence\n- `ChunkPress` ([source](kvpress\u002Fpresses\u002Fchunk_press.py), [paper](https:\u002F\u002Fdirect.mit.edu\u002Ftacl\u002Farticle\u002Fdoi\u002F10.1162\u002Ftacl_a_00716\u002F125280)): compress the KV cache on each sequence chunk separately. This can yield to more uniform compression across long sequences\n- `CriticalKVPress` and `CriticalAdaKVPress` ([source](kvpress\u002Fpresses\u002Fcriticalkv_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03805)): refine the scores using the L1 norm of Wo @ values, coupled with a two-stage selection.\n- `BlockPress` ([source](kvpress\u002Fpresses\u002Fblock_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15364)): segment input sequence into non-overlapping blocks and compress iteratively (⚠️ not a true chunked-prefill implementation)\n- `DecodingPress` ([source](kvpress\u002Fpresses\u002Fdecoding_press.py)): allow for compression during decoding, see decoding section in this README.\n- `PrefillDecodingPress` ([source](kvpress\u002Fpresses\u002Fprefill_decoding_press.py)): allow to compress both during prefilling and during decoding.\n- `DMSPress` ([source](kvpress\u002Fpresses\u002Fdms_press.py), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05345)): evict keys and values with scores below a given threshold of any `ScorerPress` instead of relying on top-k scores. Support both prefilling and decoding (if decoding=True), but only supports dense-prefill and not sparse-prefill.\n\nFor a detailed list of existing KV cache compression methods, check [Awesome-KV-Cache-Compression](https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression) or [Awesome-LLM-Compression](https:\u002F\u002Fgithub.com\u002FHuangOwen\u002FAwesome-LLM-Compression?tab=readme-ov-file#kv-cache-compression)\n\n\n## Evaluation\nWe provide a simple CLI to evaluate the performance of different presses on several long-context datasets. \n\n- Accuracy: Test your method on popular benchmarks directly using our CLI. \n- Speed and Memory: The [speed_and_memory](notebooks\u002Fspeed_and_memory.ipynb) notebook can help you measure peak memory usage and total time gain.\n\nPlease refer to the [evaluation](evaluation\u002FREADME.md) directory in this repo for more details and results. \n\nBelow we report the average performance on the RULER dataset with 4k context length for different presses, from our [![Hugging Face Leaderboard](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20HuggingFace-Leaderboard-orange)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnvidia\u002Fkvpress-leaderboard)\n\n## Quantization\n\nWe support KV cache quantization through the transformers `QuantizedCache` class (see [HF blog post](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fkv-cache-quantization#how-to-use-quantized-kv-cache-in-%F0%9F%A4%97-transformers)). To use it, simply pass a cache object to your pipeline:\n\n```python\nfrom transformers import QuantizedCache\n\ncache = QuantizedCache(backend=\"quanto\", nbits=4)\n\npipe(..., cache=cache)\n```\n\nBy default, the `DynamicCache` is used (no quantization). \n\n> [!IMPORTANT]  \n> To use the `QuantizedCache`, you need to install additional dependencies (_e.g._ `pip install optimum-quanto`).\n\n## Contributing\n\nWe welcome contributions! To add a new press, simply open an issue or submit a pull request. Check the [new_press.ipynb](notebooks\u002Fnew_press.ipynb) notebook for a step-by-step guide.\n\n## Citation\n\nIf you use KVPress in your research, please cite our paper:\n\n```bibtex\n@article{devoto2025expectedattention,\n  title={Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution},\n  author={Devoto, Alessio and Jeblick, Maximilian and J{\\'e}gou, Simon},\n  journal={arXiv preprint arXiv:2510.00636},\n  year={2025},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.00636}\n}\n```\n\n## FAQ\n\n\u003Cdetails>\u003Csummary> \n\n### Which models are supported ? \n\u003C\u002Fsummary>\n\nSome presses depend on the model architecture (_e.g._ `ExpectedAttentionPress` or `SnapKVPress`) hence they might not work with all models. We tested support for `LlamaForCausalLM`, `MistralForCausalLM`, `Phi3ForCausalLM`, `Qwen2ForCausalLM`, `Qwen3ForCausalLM`, and `Gemma3ForCausalLM` but many other models might be supported out of the box because their implementation is often similar in transformers.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary> \n\n### How to run inference on multiple GPUs ? \n\u003C\u002Fsummary>\n\nkvpress supports multi-GPU inference through [accelerate](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Fen\u002Findex):\n\n```python\npipe = pipeline(\"kv-press-text-generation\", model=model, device_map=\"auto\")\n```\n\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary> \n\n### What are the memory and throughput gains ?\n\u003C\u002Fsummary>\n\nMemory usage should be reduced by around `compression_ratio * kv_cache_size`. As the KV cache is smaller, decoding should also be faster. You can measure peak memory usage gain and total time gain using [this notebook](notebooks\u002Fspeed_and_memory.ipynb).\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary> \n\n### How does a press work ? \u003C\u002Fsummary>\n\nA press registers a forward hook (`press.forward_hook` method) to each attention layer during the prefilling phase. Registration can be applied using the press as a context manager (`press.__call__` method):\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM\nfrom kvpress import KnormPress\n\ndevice = \"cuda:0\"\nckpt = \"meta-llama\u002FMeta-Llama-3.1-8B-Instruct\"\nmodel = AutoModelForCausalLM.from_pretrained(ckpt).to(device)\npress = KnormPress(compression_ratio=0.4)\n\ninputs = model.dummy_inputs[\"input_ids\"].to(device)\n\nwith torch.no_grad():\n    print(model(inputs).past_key_values[0][0].shape)\n    # torch.Size([3, 8, 5, 128])\n    \nwith torch.no_grad(), press(model):\n    print(model(inputs).past_key_values[0][0].shape)\n    # torch.Size([3, 8, 3, 128])\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary> \n\n### Why not using model.generate ? \n\u003C\u002Fsummary>\n\nIn fact you can use `model.generate` with a press by using the press as a context manager:\n\n```python\nwith press(model):\n    outputs = model.generate(inputs)\n```\n\nHowever, the `generate` method does not allow to exclude the question from the compression, which would artificially favors methods such as SnapKV. Ideally, we want a compression method that works whatever comes after the context (_e.g._ for use cases such as chat or document question answering). Finally the `generate` method does not allow to provide generation for multiple questions at once.\n\n\u003C\u002Fdetails>\n\n\n\n\u003Cdetails>\u003Csummary> \n\n### Can I combine compression during prefilling and decoding ? \n\u003C\u002Fsummary>\n\n\nCombines separate presses for prefilling and decoding phases.\n\n**Parameters:**\n- `prefilling_press`: Press used during prefill phase\n- `decoding_press`: Press used during decoding phase\n\n## Usage Examples\n\n### Basic Decoding Compression\n\n```python\nfrom transformers import pipeline\nfrom kvpress import KnormPress\nfrom kvpress import DecodingPress\n\n# Initialize the pipeline\ndevice = \"cuda:0\"\nmodel = \"meta-llama\u002FLlama-3.1-8B-Instruct\"\nmodel_kwargs = {\"attn_implementation\": \"flash_attention_2\"}\npipe = pipeline(\"kv-press-text-generation\", model=model, device=device, model_kwargs=model_kwargs)\n\n# Create a decoding press that compresses every 10 steps to 512 tokens\ndecoding_press = DecodingPress(\n    base_press=KnormPress(),\n    compression_steps=10,\n    token_buffer_size=512\n)\n\n# Use with pipeline\ncontext = \"A very long text you want to compress during generation\"\nquestion = \"Tell me a long story about this context\"\nresponse = pipe(context, question=question, press=decoding_press)[\"answer\"]\n```\n\n### Combined Prefill + Decoding Compression\n\n```python\nfrom transformers import pipeline\nfrom kvpress import CriticalKVPress, KnormPress\nfrom kvpress import DecodingPress, PrefillDecodingPress\n\n# Initialize the pipeline\ndevice = \"cuda:0\"\nmodel = \"meta-llama\u002FLlama-3.1-8B-Instruct\"\nmodel_kwargs = {\"attn_implementation\": \"flash_attention_2\"}\npipe = pipeline(\"kv-press-text-generation\", model=model, device=device, model_kwargs=model_kwargs)\n\n# Different strategies for prefill vs decoding\nprefill_press = CriticalKVPress(KnormPress())\ndecoding_press = DecodingPress(\n    base_press=KnormPress(compression_ratio=0.2),\n    compression_steps=5,\n    token_buffer_size=256\n)\n\n# Combine them\ncombined_press = PrefillDecodingPress(\n    prefilling_press=prefill_press,\n    decoding_press=decoding_press\n)\n\ncontext = \"A very long context that will be compressed during prefill\"\nquestion = \"Generate a detailed analysis that will be compressed during decoding\"\nresponse = pipe(context, question=question, press=combined_press)[\"answer\"]\n```\n","[![PyPI版本](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fkvpress.svg)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fkvpress)\n[![许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n[![Colab示例笔记本](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1JNvaTKuuAHrl49dYB9-mdEH_y52Ib-NP?usp=drive_link)\n[![Hugging Face Space](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20Hugging%20Face-Space-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnvidia\u002Fkvpress)\n[![博客文章](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20Hugging%20Face-Blog-blue)](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fnvidia\u002Fkvpress)\n[![Hugging Face排行榜](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20HuggingFace-Leaderboard-orange)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnvidia\u002Fkvpress-leaderboard)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2510.00636-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.00636v1)\n\n\n![kvpress](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_kvpress_readme_dff5230412fb.jpg)\n\n\n由于Transformer模型中的键值（KV）缓存呈线性增长，部署长上下文LLM的成本非常高。例如，使用Llama 3.1-70B以float16格式处理100万 tokens 需要高达330GB的内存。kvpress实现了多种KV缓存压缩方法，并基于🤗 Transformers进行了基准测试，旨在为该领域的研究人员和开发者简化新方法的开发流程。\n\n## 安装\n\n```bash\npip install kvpress\n```\n\n若需本地安装，请使用 [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F)：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress.git\ncd kvpress\nuv sync\n```\n\n如需安装所有可选依赖项，请执行以下命令：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress.git\ncd kvpress\nuv sync --extra eval --extra flash-attn\n```\n\n## 使用\n\nKVPress提供了一系列在预填充阶段压缩KV缓存的“压缩器”。每个压缩器都关联有一个`compression_ratio`属性，用于衡量缓存的压缩程度。使用压缩器最简单的方式是通过我们自定义的`KVPressTextGenerationPipeline`。当导入kvpress时，它会自动注册为名为“kv-press-text-generation”的Transformers流水线，并为您处理聊天模板和分词：\n\n```python\nfrom transformers import pipeline\nfrom kvpress import ExpectedAttentionPress\n\nmodel = \"Qwen\u002FQwen3-8B\"\npipe = pipeline(\"kv-press-text-generation\", model=model, device_map=\"auto\", dtype=\"auto\")\n\ncontext = \"一段您希望一次性压缩的超长文本\"\nquestion = \"\\n关于压缩后上下文的一个问题\"  # 可选\n\npress = ExpectedAttentionPress(compression_ratio=0.5)\nanswer = pipe(context, question=question, press=press)[\"answer\"]\n```\n\n在上述代码片段中，压缩仅应用于上下文tokens，以便您可以针对不同问题评估压缩效果。更多详细示例请参阅[Wikipedia笔记本演示](notebooks\u002Fwikipedia_demo.ipynb)，该演示也在Colab上提供[此处](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1JNvaTKuuAHrl49dYB9-mdEH_y52Ib-NP)。\n\n\u003Cdetails>\u003Csummary>\n解码阶段的压缩\n\u003C\u002Fsummary>\n默认情况下，KVPress会在预填充阶段应用压缩。作为一项新的（实验性）功能，我们现在支持通过`DecodingPress`包装器实现解码阶段的压缩。`DecodingPress`会在生成token的过程中定期压缩KV缓存，同时可以选择性地保留最近隐藏状态的缓冲区。`DecodingPress`支持以下参数：\n\n- `base_press`: 任何ScorerPress（例如`KNormPress`、`CriticalKVPress`）\n- `compression_interval`: 压缩之间的步数（默认：10）\n- `target_size`: 压缩后缓存的目标大小（默认：1024）\n- `hidden_states_buffer_size`: 压缩前需要缓冲的隐藏状态数量（默认：128）。某些压缩器不需要缓冲隐藏状态，可以将此值设为0。\n\n与压缩比不同，解码压缩使用`target_size`来控制缓存的压缩程度。这意味着缓存会每隔`compression_interval`步进行一次压缩，而压缩比则会自动计算，以确保压缩后的缓存大小等于`target_size`。\n\n解码压缩的示例：\n\n```python\nfrom transformers import pipeline\nfrom kvpress import KnormPress\nfrom kvpress import DecodingPress\n\n# 初始化流水线\ndevice = \"cuda:0\"\nmodel = \"meta-llama\u002FLlama-3.1-8B-Instruct\"\nmodel_kwargs = {\"attn_implementation\": \"flash_attention_2\"}\npipe = pipeline(\"kv-press-text-generation\", model=model，device=device，model_kwargs=model_kwargs)\n\n# 创建一个每10步压缩一次，目标缓存大小为512 tokens 的解码压缩器\ndecoding_press = DecodingPress(\n    base_press=KnormPress(),\n    compression_steps=10，\n    token_buffer_size=512\n)\n\n# 与流水线结合使用\ncontext = \"一段您希望在生成过程中压缩的超长文本\"\nquestion = \"请根据这段上下文讲一个长故事\"\nresponse = pipe(context，question=question，press=decoding_press)[\"answer\"]\n```\n\n> 并非所有现有压缩器都完全兼容`DecodingPress`，因为解码阶段和预填充阶段的压缩机制存在根本差异。特别是，我们目前仅支持将ScorerPress用作基础压缩器。\n\u003C\u002Fdetails>\n\n## 可用的压缩器\n\n当前所有的压缩器都不需要训练，并且都继承自 `BasePress`（[源代码](kvpress\u002Fpresses\u002Fbase_press.py)）。\n\n其中，若干压缩器继承自 `ScorerPress`（[源代码](kvpress\u002Fpresses\u002Fscorer_press.py)），它们依赖于一个分数来裁剪重要性最低的 KV 对：\n\n- `RandomPress`（[源代码](kvpress\u002Fpresses\u002Frandom_press.py)）：随机分数\n- `KnormPress`（[源代码](kvpress\u002Fpresses\u002Fknorm_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.11430)）：键的逆范数\n- `SnapKVPress`（[源代码](kvpress\u002Fpresses\u002Fsnapkv_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14469)）：最后查询的平均注意力权重\n- `ExpectedAttentionPress`（[源代码](kvpress\u002Fpresses\u002Fexpected_attention_press.py)，[笔记本](notebooks\u002Fexpected_attention.ipynb)）：生成阶段的期望注意力权重\n- `StreamingLLMPress`（[源代码](kvpress\u002Fpresses\u002Fstreaming_llm_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17453)）：仅保留初始和最近的标记\n- `TOVAPress`（[源代码](kvpress\u002Fpresses\u002Ftova_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06104)）：对所有头的最后查询注意力权重取平均\n- `ObservedAttentionPress`（[源代码](kvpress\u002Fpresses\u002Fobserved_attention_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14048)）：在预填充阶段观察到的平均注意力权重\n- `QFilterPress`（[源代码](kvpress\u002Fpresses\u002Fqfilter_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02812)）：将键表示投影到查询向量的主要奇异值分解成分上，以近似计算注意力分数。\n- `PyramidKVPress`（[源代码](kvpress\u002Fpresses\u002Fpyramidkv_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.02069)）：维持金字塔形的缓存大小，为较低层分配更多缓存预算，而为较高层分配较少。\n- `LagKVPress`（[源代码](kvpress\u002Fpresses\u002Flagkv_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.04704)）：利用 KV 延迟相关信息进行压缩。它无需查询、无需注意力权重，且兼容 Flash Attention。\n- `KeyDiffPress`（[源代码](kvpress\u002Fpresses\u002Fkeydiff_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15364)）：仅根据键的相似性驱逐标记。\n- `NonCausalAttnPress`（[源代码](kvpress\u002Fpresses\u002Fnon_causal_attention_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.08143)）：基于非因果分块注意力分数驱逐标记。\n- `LeverageScorePress`（[源代码](kvpress\u002Fpresses\u002Fleverage_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.08143)）：根据近似的统计杠杆度驱逐标记（即保留键空间中的异常值）。\n- `CompactorPress`（[源代码](kvpress\u002Fpresses\u002Fcompactor_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.08143)）：根据压缩比混合使用 `NonCausalAttnPress` 和 `LeverageScorePress`。\n- `CURPress`（[源代码](kvpress\u002Fpresses\u002Fcur_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.15038)）：基于 CUR 分解和近似杠杆度分数修剪键和值。\n- `KVzapPress`（[源代码](kvpress\u002Fpresses\u002Fkvzap\u002Fkvzap_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.07891)，[训练](kvzap)）：通过快速代理模型近似 KVzip+。需与 `DMSPress` 配合使用。\n- `FastKVzipPress`（[源代码](kvpress\u002Fpresses\u002Ffastkvzip_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.17668)）：通过轻量级门控机制近似 KVzip。\n\n还有一些压缩器采用不同的逻辑：\n- `ThinKPress`（[源代码](kvpress\u002Fpresses\u002Fthink_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21018)）：根据最后查询上的通道注意力分数压缩键的维度。\n- `SimLayerKVPress`（[源代码](kvpress\u002Fpresses\u002Fsimlayerkv_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13846)）：识别“懒惰”层，并对其应用 StreamingLLM 方法。\n- `DuoAttentionPress`（[源代码](kvpress\u002Fpresses\u002Fduo_attention_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10819)）：将注意力头分为检索头（不压缩）和流式头（采用 StreamingLLM 方法）。\n- `FinchPress`（[源代码](kvpress\u002Fpresses\u002Ffinch_press.py)，[论文](https:\u002F\u002Fdirect.mit.edu\u002Ftacl\u002Farticle\u002Fdoi\u002F10.1162\u002Ftacl_a_00716\u002F125280)）：类似于 SnapKV，但具有动态窗口大小和键值重旋转。\n- `KVzipPress`（[源代码](kvpress\u002Fpresses\u002Fkvzip_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23416)）：通过上下文重建识别冗余的 KV 对。以多次前向传播为代价，实现近乎无损的压缩。\n\n最后，我们提供了一些可以与其他压缩器组合使用的包装型压缩器：\n- `AdaKVPress`（[源代码](kvpress\u002Fpresses\u002Fadakv_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11550)）：裁剪任何 `ScorerPress` 的底部分数，但跨所有头执行，从而实现按头的压缩。\n- `PerLayerCompressionPress`（[源代码](kvpress\u002Fpresses\u002Fper_layer_compression_press.py)）：为每一层使用不同的压缩比进行压缩（实验性功能）。\n- `ComposedPress`（[源代码](kvpress\u002Fpresses\u002Fcomposed_press.py)）：通过串联多个压缩器的前向钩子将其组合在一起。\n- `KeyRerotationPress`（[源代码](kvpress\u002Fpresses\u002Fkey_rerotation_press.py)）：对已压缩的键进行重新旋转，以保持连续的 RoPE 嵌入。\n- `ChunkKVPress`（[源代码](kvpress\u002Fpresses\u002Fchunkkv_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00299)）：通过选择重要区块进行压缩，同时保持语义连贯性。\n- `ChunkPress`（[源代码](kvpress\u002Fpresses\u002Fchunk_press.py)，[论文](https:\u002F\u002Fdirect.mit.edu\u002Ftacl\u002Farticle\u002Fdoi\u002F10.1162\u002Ftacl_a_00716\u002F125280)）：分别对每个序列区块的 KV 缓存进行压缩。这可以在长序列中实现更均匀的压缩效果。\n- `CriticalKVPress` 和 `CriticalAdaKVPress`（[源代码](kvpress\u002Fpresses\u002Fcriticalkv_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03805)）：利用 Wo @ 值的 L1 范数细化分数，并结合两阶段选择机制。\n- `BlockPress`（[源代码](kvpress\u002Fpresses\u002Fblock_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15364)）：将输入序列划分为不重叠的块，并迭代式地进行压缩（⚠️ 并非真正的分块预填充实现）。\n- `DecodingPress`（[源代码](kvpress\u002Fpresses\u002Fdecoding_press.py)）：允许在解码过程中进行压缩，请参阅本 README 中的解码部分。\n- `PrefillDecodingPress`（[源代码](kvpress\u002Fpresses\u002Fprefill_decoding_press.py)）：允许在预填充和解码过程中同时进行压缩。\n- `DMSPress`（[源代码](kvpress\u002Fpresses\u002Fdms_press.py)，[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05345)）：对于任何 `ScorerPress`，驱逐分数低于给定阈值的键和值，而不是依赖于 top-k 分数。支持预填充和解码（若 decoding=True），但仅支持密集预填充，不支持稀疏预填充。\n\n有关现有 KV 缓存压缩方法的详细列表，请参阅 [Awesome-KV-Cache-Compression](https:\u002F\u002Fgithub.com\u002FOctober2001\u002FAwesome-KV-Cache-Compression) 或 [Awesome-LLM-Compression](https:\u002F\u002Fgithub.com\u002FHuangOwen\u002FAwesome-LLM-Compression?tab=readme-ov-file#kv-cache-compression)。\n\n## 评估\n我们提供了一个简单的命令行界面，用于在多个长上下文数据集上评估不同压缩方法的性能。\n\n- 准确性：使用我们的 CLI 直接在流行的基准测试上测试您的方法。\n- 速度与内存：[speed_and_memory](notebooks\u002Fspeed_and_memory.ipynb) 笔记本可以帮助您测量峰值内存使用量和总时间节省。\n\n有关更多详细信息和结果，请参阅此仓库中的 [evaluation](evaluation\u002FREADME.md) 目录。\n\n以下是我们从 [![Hugging Face Leaderboard](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20HuggingFace-Leaderboard-orange)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnvidia\u002Fkvpress-leaderboard) 报告的不同压缩方法在 RULER 数据集中、4k 上下文长度下的平均性能。\n\n## 量化\n我们通过 transformers 的 `QuantizedCache` 类支持 KV 缓存量化（参见 [HF 博客文章](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fkv-cache-quantization#how-to-use-quantized-kv-cache-in-%F0%9F%A4%97-transformers)）。要使用它，只需将缓存对象传递给您的管道：\n\n```python\nfrom transformers import QuantizedCache\n\ncache = QuantizedCache(backend=\"quanto\", nbits=4)\n\npipe(..., cache=cache)\n```\n\n默认情况下，使用的是 `DynamicCache`（不进行量化）。\n\n> [!重要]  \n> 要使用 `QuantizedCache`, 您需要安装额外的依赖项（例如 `pip install optimum-quanto`）。\n\n## 贡献\n我们欢迎贡献！要添加新的压缩方法，只需打开一个问题或提交拉取请求。请查看 [new_press.ipynb](notebooks\u002Fnew_press.ipynb) 笔记本，获取分步指南。\n\n## 引用\n如果您在研究中使用 KVPress，请引用我们的论文：\n\n```bibtex\n@article{devoto2025expectedattention,\n  title={Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution},\n  author={Devoto, Alessio and Jeblick, Maximilian and J{\\'e}gou, Simon},\n  journal={arXiv preprint arXiv:2510.00636},\n  year={2025},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.00636}\n}\n```\n\n## 常见问题解答\n\n\u003Cdetails>\u003Csummary> \n\n### 支持哪些模型？ \n\u003C\u002Fsummary>\n\n某些压缩方法依赖于模型架构（例如 `ExpectedAttentionPress` 或 `SnapKVPress`），因此可能无法适用于所有模型。我们已测试了对 `LlamaForCausalLM`、`MistralForCausalLM`、`Phi3ForCausalLM`、`Qwen2ForCausalLM`、`Qwen3ForCausalLM` 和 `Gemma3ForCausalLM` 的支持，但由于这些模型在 transformers 中的实现通常相似，许多其他模型也可能开箱即用地得到支持。\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary> \n\n### 如何在多 GPU 上运行推理？ \n\u003C\u002Fsummary>\n\nkvpress 通过 [accelerate](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Fen\u002Findex) 支持多 GPU 推理：\n\n```python\npipe = pipeline(\"kv-press-text-generation\", model=model, device_map=\"auto\")\n```\n\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary> \n\n### 内存和吞吐量的提升是多少？\n\u003C\u002Fsummary>\n\n内存使用量应减少大约 `compression_ratio * kv_cache_size`。由于 KV 缓存更小，解码速度也会更快。您可以使用 [this notebook](notebooks\u002Fspeed_and_memory.ipynb) 测量峰值内存使用量的提升和总时间节省。\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary> \n\n### 压缩方法是如何工作的？ \u003C\u002Fsummary>\n\n压缩方法会在预填充阶段为每个注意力层注册一个前向钩子（`press.forward_hook` 方法）。注册可以通过将压缩方法作为上下文管理器（`press.__call__` 方法）来应用：\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM\nfrom kvpress import KnormPress\n\ndevice = \"cuda:0\"\nckpt = \"meta-llama\u002FMeta-Llama-3.1-8B-Instruct\"\nmodel = AutoModelForCausalLM.from_pretrained(ckpt).to(device)\npress = KnormPress(compression_ratio=0.4)\n\ninputs = model.dummy_inputs[\"input_ids\"].to(device)\n\nwith torch.no_grad():\n    print(model(inputs).past_key_values[0][0].shape)\n    # torch.Size([3, 8, 5, 128])\n    \nwith torch.no_grad(), press(model):\n    print(model(inputs).past_key_values[0][0].shape)\n    # torch.Size([3, 8, 3, 128])\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary> \n\n### 为什么不直接使用 model.generate？ \n\u003C\u002Fsummary>\n\n实际上，您也可以通过将压缩方法作为上下文管理器来配合 `model.generate` 使用：\n\n```python\nwith press(model):\n    outputs = model.generate(inputs)\n```\n\n然而，`generate` 方法无法将问题部分排除在压缩之外，这会人为地偏向像 SnapKV 这样的方法。理想情况下，我们希望有一种无论后续内容如何都能有效工作的压缩方法（例如用于聊天或文档问答等场景）。此外，`generate` 方法也不支持同时生成多个问题的回答。\n\u003C\u002Fdetails。\n\n\n\n\u003Cdetails>\u003Csummary> \n\n### 我可以结合预填充和解码阶段的压缩吗？ \n\u003C\u002Fsummary>\n\n\n结合预填充和解码阶段的独立压缩方法。\n\n**参数：**\n- `prefilling_press`: 预填充阶段使用的压缩方法\n- `decoding_press`: 解码阶段使用的压缩方法\n\n## 使用示例\n\n### 基本解码压缩\n\n```python\nfrom transformers import pipeline\nfrom kvpress import KnormPress\nfrom kvpress import DecodingPress\n\n# 初始化管道\ndevice = \"cuda:0\"\nmodel = \"meta-llama\u002FLlama-3.1-8B-Instruct\"\nmodel_kwargs = {\"attn_implementation\": \"flash_attention_2\"}\npipe = pipeline(\"kv-press-text-generation\", model=model, device=device, model_kwargs=model_kwargs)\n\n# 创建一个每 10 步压缩到 512 个 token 的解码压缩方法\ndecoding_press = DecodingPress(\n    base_press=KnormPress(),\n    compression_steps=10,\n    token_buffer_size=512\n)\n\n# 与管道一起使用\ncontext = \"一段非常长的文本，您希望在生成过程中对其进行压缩\"\nquestion = \"请根据这段上下文讲一个长故事\"\nresponse = pipe(context, question=question, press=decoding_press)[\"answer\"]\n```\n\n### 预填充 + 解码联合压缩\n\n```python\nfrom transformers import pipeline\nfrom kvpress import CriticalKVPress、KnormPress\nfrom kvpress import DecodingPress、PrefillDecodingPress\n\n# 初始化管道\ndevice = \"cuda:0\"\nmodel = \"meta-llama\u002FLlama-3.1-8B-Instruct\"\nmodel_kwargs = {\"attn_implementation\": \"flash_attention_2\"}\npipe = pipeline(\"kv-press-text-generation\", model=model, device=device, model_kwargs=model_kwargs)\n\n# 预填充和解码采用不同的策略\nprefill_press = CriticalKVPress(KnormPress())\ndecoding_press = DecodingPress(\n    base_press=KnormPress(compression_ratio=0.2),\n    compression_steps=5,\n    token_buffer_size=256\n)\n\n# 将两者结合\ncombined_press = PrefillDecodingPress(\n    prefilling_press=prefill_press,\n    decoding_press=decoding_press\n)\n\ncontext = \"一段非常长的上下文，将在预填充阶段被压缩\"\nquestion = \"生成一份详细的分析报告，该报告将在解码阶段被压缩\"\nresponse = pipe(context，question=question，press=combined_press)[\"answer\"]\n```","# kvpress 快速上手指南\n\nkvpress 是一个基于 🤗 transformers 的开源工具库，旨在通过多种压缩方法减少大语言模型（LLM）在长上下文场景下的 KV 缓存显存占用，从而降低部署成本并提升推理效率。\n\n## 环境准备\n\n*   **系统要求**：Linux 或 macOS 环境，推荐配备 NVIDIA GPU 以加速推理。\n*   **Python 版本**：建议 Python 3.9 及以上。\n*   **核心依赖**：\n    *   `transformers` (Hugging Face)\n    *   `torch` (PyTorch)\n    *   `accelerate`\n*   **可选依赖**：若需极致性能，建议安装 `flash-attn` (Flash Attention 2)。\n\n> **国内加速建议**：在安装依赖时，推荐使用清华或阿里镜像源以提升下载速度。\n> ```bash\n> export PIP_INDEX_URL=https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 安装步骤\n\n### 方式一：通过 PyPI 安装（推荐）\n\n最简便的安装方式，适用于大多数用户：\n\n```bash\npip install kvpress\n```\n\n### 方式二：源码安装（含可选依赖）\n\n如果你需要本地开发或使用评估工具及 Flash Attention，建议使用 `uv` 进行源码安装：\n\n```bash\n# 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress.git\ncd kvpress\n\n# 同步基础环境 (需先安装 uv: pip install uv)\nuv sync\n\n# 同步包含评估工具和 flash-attn 的完整环境\nuv sync --extra eval --extra flash-attn\n```\n\n## 基本使用\n\nkvpress 的核心概念是\"Press\"（压缩器），它可以在预填充（prefilling）阶段压缩 KV 缓存。最简单的方式是使用封装好的 `KVPressTextGenerationPipeline`，它自动处理分词和对话模板。\n\n以下示例演示如何使用 `ExpectedAttentionPress` 将上下文压缩 50%：\n\n```python\nfrom transformers import pipeline\nfrom kvpress import ExpectedAttentionPress\n\n# 1. 初始化管道\n# 管道名称固定为 \"kv-press-text-generation\"\nmodel = \"Qwen\u002FQwen3-8B\"\npipe = pipeline(\"kv-press-text-generation\", model=model, device_map=\"auto\", dtype=\"auto\")\n\n# 2. 定义长上下文和问题\ncontext = \"A very long text you want to compress once and for all\"\nquestion = \"\\nA question about the compressed context\"  # 可选，用于测试不同问题下的压缩效果\n\n# 3. 配置压缩策略\n# compression_ratio=0.5 表示保留 50% 的 KV 对\npress = ExpectedAttentionPress(compression_ratio=0.5)\n\n# 4. 执行推理\n# 压缩仅应用于 context 部分，question 部分保持完整以确保回答质量\nanswer = pipe(context, question=question, press=press)[\"answer\"]\n\nprint(answer)\n```\n\n### 进阶：解码阶段压缩（实验性功能）\n\n除了预填充阶段，kvpress 还支持在生成（decoding）过程中定期压缩缓存。这通过 `DecodingPress` 包装器实现：\n\n```python\nfrom transformers import pipeline\nfrom kvpress import KnormPress, DecodingPress\n\n# 初始化管道 (建议使用 flash_attention_2)\nmodel = \"meta-llama\u002FLlama-3.1-8B-Instruct\"\npipe = pipeline(\n    \"kv-press-text-generation\", \n    model=model, \n    device=\"cuda:0\", \n    model_kwargs={\"attn_implementation\": \"flash_attention_2\"}\n)\n\n# 创建解码压缩器\n# 每生成 10 个 token 压缩一次，目标是将缓存大小维持在 512 个 token\ndecoding_press = DecodingPress(\n    base_press=KnormPress(),\n    compression_steps=10,\n    token_buffer_size=512\n)\n\ncontext = \"A very long text you want to compress during generation\"\nquestion = \"Tell me a long story about this context\"\n\nresponse = pipe(context, question=question, press=decoding_press)[\"answer\"]\n```\n\n> **注意**：并非所有压缩算法都支持解码阶段压缩，目前主要支持继承自 `ScorerPress` 的算法（如 `KnormPress`, `CriticalKVPress` 等）。","某法律科技团队正在构建基于 Llama 3.1-70B 的智能合同审查系统，需要让模型一次性读取并分析长达数十万字的复杂并购协议。\n\n### 没有 kvpress 时\n- **显存成本极高**：处理百万级 token 上下文时，仅 KV 缓存就需要占用高达 330GB 显存，迫使团队租用昂贵的多卡 A100\u002FH100 集群。\n- **推理延迟严重**：随着文档长度增加，注意力机制的计算量线性增长，导致首字生成等待时间过长，无法满足实时交互需求。\n- **部署门槛过高**：巨大的内存需求使得在单卡消费级显卡或边缘设备上部署长文本模型成为不可能，限制了产品落地场景。\n- **开发调试困难**：研究人员尝试自定义压缩算法时，需深入修改底层 Transformer 代码，迭代周期长且容易引入错误。\n\n### 使用 kvpress 后\n- **显存占用骤降**：通过 ExpectedAttentionPress 等策略在预填充阶段压缩缓存，将显存需求降低 50% 以上，单卡即可运行超长上下文任务。\n- **推理速度提升**：大幅减少了注意力计算的关键值对数量，显著缩短首字延迟，使长文档问答变得流畅自然。\n- **部署灵活便捷**：借助 `kv-press-text-generation` 管道，无需改动模型架构即可轻松将长文本能力集成到现有服务中，甚至支持更轻量级的硬件。\n- **算法验证高效**：内置多种压缩基准和评分器，开发者可快速切换不同压缩策略（如 KNormPress）并对比效果，加速算法研发。\n\nkvpress 通过高效的 KV 缓存压缩技术，打破了长上下文大模型的显存与速度瓶颈，让百亿参数模型在低成本硬件上处理百万字文档成为现实。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_kvpress_dff52304.jpg","NVIDIA","NVIDIA Corporation","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNVIDIA_7dcf6000.png","",null,"https:\u002F\u002Fnvidia.com","https:\u002F\u002Fgithub.com\u002FNVIDIA",[80,84,88],{"name":81,"color":82,"percentage":83},"Python","#3572A5",98.2,{"name":85,"color":86,"percentage":87},"Shell","#89e051",1.4,{"name":89,"color":90,"percentage":91},"Makefile","#427819",0.4,1011,125,"2026-04-06T01:52:14","Apache-2.0","Linux, macOS","需要 NVIDIA GPU（示例代码指定 device='cuda:0'），部分功能（如 FlashAttention）需特定硬件支持。显存需求取决于模型大小和上下文长度（原文示例：Llama 3.1-70B 处理 1M tokens 需 330GB，压缩可降低此需求）。","未说明",{"notes":100,"python":98,"dependencies":101},"该工具旨在压缩长上下文大模型的 KV 缓存以节省显存。支持通过 pip 或 uv 安装。核心功能依赖 Hugging Face transformers 库。可选依赖包括用于评估的包和 flash-attn（用于加速注意力机制）。提供多种无需训练的压缩算法（Presses），部分算法实验性支持在解码阶段进行压缩。",[102,103,104,105],"torch","transformers","accelerate","flash-attn (可选)",[35,14],[108,109,110,111,112,113,114,103,115],"llm","inference","kv-cache","kv-cache-compression","long-context","python","pytorch","large-language-models","2026-03-27T02:49:30.150509","2026-04-07T17:04:41.788869",[119,124,129,134,139,143],{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},22478,"在 Google Colab 中运行 kvpress 时遇到 ImportError，提示无法从 transformers 导入 'QuantizedCacheConfig'，该如何解决？","这通常是因为环境配置问题。在 Colab 中，只需运行 `pip install kvpress` 即可自动安装正确版本的依赖。项目维护者提供了一个已验证可用的入门 Notebook（链接：https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1JNvaTKuuAHrl49dYB9-mdEH_y52Ib-NP?usp=drive_link），建议直接使用该 Notebook 或分享您的 Notebook 以便进一步排查。请勿手动安装特定版本的 transformers，以免版本不兼容。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fissues\u002F151",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},22479,"如何为 kvpress 贡献新的压缩方法（如 CriticalKV）？代码实现上有什么建议？","您可以直接在您的分支上开发并随时提交 Pull Request (PR)。关于实现细节，维护者建议优先采用“按头（head-wise）”的计算方式，而不是引入额外的用户参数，这样既简单又能满足大多数场景需求。虽然沿序列长度分割可能也是一种方法，但按头计算在内存效果上与计算 Q、K 或 V 相同，且更易于用户理解。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fissues\u002F45",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},22480,"kvpress 是否支持针对不同注意力头使用不同压缩率的“特定头（head-specific）”压缩功能？","该功能已通过 Issue #38 得到初步解决。目前的实现方案是使用“近似掩码（approximate masking）”方法来模拟特定头的压缩，这种方法无需编写自定义内核，代码更简洁且符合 Transformers 库的单文件策略。虽然早期开发曾尝试过自定义内核以提高效率，但为了代码可维护性，目前推荐使用基于 PyTorch 操作的缓存管理方法。注意：某些旧方案可能会在 transformers v4.48 版本后被弃用。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fissues\u002F7",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},22481,"在压缩缓存时，如何调整位置编码（Positional Embedding）以避免长序列输出乱码？","长序列输出乱码通常是因为缓存压缩未考虑位置编码导致的对齐错误。解决方案是在压缩过程中动态调整余弦（cos）和正弦（sin）位置嵌入。可以通过扩展 `transformers` 库中的 `DynamicCache` 类来实现（例如名为 `FinchCache` 的实现），在压缩时重新旋转（rerotate）key 的位置编码。实验表明，使用经过调整的上下文长度和重旋转 key 的分支可以显著提高模型在长上下文下的输出质量。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fissues\u002F16",{"id":140,"question_zh":141,"answer_zh":142,"source_url":138},22482,"为什么我的压缩后序列长度（seq length）获取不正确？是否与量化缓存有关？","如果您使用的是量化缓存（Quantized Cache），`get_seq_length` 方法可能会基于 `_seen_tokens` 返回长度，而在应用重旋转（rerotation）后该值可能未被正确设置，从而导致长度错误。建议检查是否使用了标准的 `DynamicCache`，因为它通常能正确返回压缩后的 token 长度。如果必须使用量化缓存，需确保在重旋转操作后手动更新或校验 `_seen_tokens` 状态。",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},22483,"运行评估脚本得到的结果与 Hugging Face 图表中的性能指标不一致，可能的原因是什么？","结果差异可能源于模型版本、评估配置或基准测试环境的细微差别。虽然具体讨论在提供的片段中被截断，但通常建议：1. 确认使用的模型权重版本是否与官方实验一致（如 Qwen3-8B-Instruct 的具体 revision）；2. 检查 `evaluate.sh` 脚本中的参数设置（如压缩率、基准数据集版本）；3. 确保运行环境与官方实验环境（如 GPU 类型、CUDA 版本、依赖库版本）保持一致。建议参考项目文档中关于复现实验的具体说明。","https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fissues\u002F147",[149,154,159,164,169,174,179,184,189,194,199,204,209,214,219,224,229,234,239,244],{"id":150,"version":151,"summary_zh":152,"released_at":153},136186,"v0.5.2","- 松弛依赖项 #200，由 @SimJeg 提交\n- 为 HuggingFace 排行榜空间添加每日健康检查，由 @maxjeblick 提交\n- 在退出上下文管理器时重置 DecodingPress 状态 #192，由 @cluster2600 提交\n- 修复评估 README 中的 uv 同步命令 #188，由 @shpark1104 提交\n- 修复 BlockPress 的文档字符串，由 @SimJeg 提交","2026-04-01T15:21:46",{"id":155,"version":156,"summary_zh":157,"released_at":158},136187,"v0.5.1","- 由 @Janghyun1230 添加 `FastKVzipPress`，#183\r\n- 由 @SimJeg 添加 `AGENTS.md` 文件，#185","2026-02-16T11:12:31",{"id":160,"version":161,"summary_zh":162,"released_at":163},136188,"v0.5.0","- 升级到 Transformers 5 版本，并修复与此升级相关的已损坏测试 #180 ","2026-01-28T11:55:52",{"id":165,"version":166,"summary_zh":167,"released_at":168},136189,"v0.4.3","- 对 `DMSPress` 和 `KVzapPress` 的小幅更新 #177\r\n- 加快 CI\u002FCD 流程 #177","2026-01-27T15:53:34",{"id":170,"version":171,"summary_zh":172,"released_at":173},136190,"v0.4.2","- 将 `ThresholdPress` 重命名为 `DMSPress` (#174)","2026-01-21T10:43:47",{"id":175,"version":176,"summary_zh":177,"released_at":178},136191,"v0.4.1","# ✨ 新功能\n\n- `KVzapPress` - KVzip 的快速近似实现，用于预填充和解码阶段的压缩（https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.07891）。附带 KVzap 的训练和评估工具（#171）\n- `ThresholdPress` - 基于得分阈值而非固定压缩比的自适应压缩方法（#171）\n\n# 📈 改进\n- 更新 `KVzipPress`，加入多项改进并支持评估注册表功能（#172）\n- 在评估配置中将 `compress-question` 重命名为 `query-aware`（#168）\n- 重构 `ObservedAttentionPress`，使实现更加清晰（#166）\n- 添加排行榜生成脚本（#171）\n\n# 🐛 错误修复\n- 修复流水线中空上下文的处理问题（#165）","2026-01-14T09:22:50",{"id":180,"version":181,"summary_zh":182,"released_at":183},136192,"v0.4.0","## 🚀 发布 v0.4.0\n\n### ✨ 新特性\n\n- **CURPress** - 基于近似 CUR 分解的、由值引导的 LLM KV 压缩 (#150)\n- **CompactorPress** - Compactor：带有近似杠杆得分的校准式查询无关 KV 缓存压缩 (#143)\n- **解码阶段压缩功能** - 支持在解码阶段进行 KV 缓存压缩 (#139)\n- **AIME25 和 Math500 基准测试** - 针对数学推理任务的新评估数据集 (#142)\n- **`post_init_from_model` 钩子** - 在 BasePress 中添加模型特定的初始化支持 (#163)\n\n### 📈 优化\n\n- 将测试迁移到 GPU，以加快 CI 执行速度 (#132)\n- 提高“大海捞针”测试的覆盖率 (#133)\n- 更新 README 和文档，提升清晰度 (#162)\n- 完善代码库中的文档字符串 (#159)\n- 更新解码笔记本，加入最新示例 (#156)\n- 代码清理：移动工具函数，整理导入语句 (#160)\n\n### 🐛 修复\n\n- 修复 LongBench-v2 基准测试评估问题 (#161)\n- 修复 kvzip 压缩访问 `past_key_values` 的问题\n- 修复 ComposedPress 的行为问题 (#148)\n- 修复导入相关问题 (#144)\n\n### 📦 安装\n\n```bash\npip install kvpress==0.4.0\n```\n\n### 📚 完整变更日志\n\nhttps:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fcompare\u002Fv0.3.0...v0.4.0","2025-12-05T08:54:36",{"id":185,"version":186,"summary_zh":187,"released_at":188},136193,"v0.3.0","## 变更内容\n* 重构：@neuralsorcerer 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F111 中优化了 ExpectedAttentionPress 中的协方差变换。\n* 修复标尺集成测试：@maxjeblick 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F113 中完成。\n* 修复拼写错误：@neuralsorcerer 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F116 中完成。\n* 添加“大海捞针”测试：@alessiodevoto 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F121 中完成。\n* 修复 masked_key_indices：@maxjeblick 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F122 中完成。\n* 添加 copy-pr-bot 设置：@maxjeblick 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F123 中完成。\n* 添加 GitHub Runner：@maxjeblick 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F124 中完成。\n* 修复评估 README.md 命令错误及日志记录错误 #127：@wzp-0815 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F128 中完成。\n* 添加 GPU Runner：@maxjeblick 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F125 中完成。\n* 升级预期注意力，支持更多模型：@alessiodevoto 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F126 中完成。\n* 添加带统计信息的预期注意力：@alessiodevoto 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F120 中完成。\n* ⚠️ Transformers 兼容性调整：@maxjeblick 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F115 中完成 ---> 这是一项破坏性变更（HF Transformers 中的 KV 缓存机制发生了变化，我们相应地调整了 KVPress）。\n\n## 新贡献者\n* @neuralsorcerer 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F111 中完成了首次贡献。\n* @wzp-0815 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F128 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fcompare\u002Fv0.2.10...v0.3.0","2025-09-04T12:47:03",{"id":190,"version":191,"summary_zh":192,"released_at":193},136194,"v0.2.10","## 变更内容\n* 迁移到 uv，由 @alessiodevoto 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F108 中完成\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fcompare\u002Fv0.2.9...v0.2.10","2025-08-06T16:10:16",{"id":195,"version":196,"summary_zh":197,"released_at":198},136195,"v0.2.9","## 变更内容\n* 由 @alessiodevoto 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F96 中重构评估逻辑\n* 由 @alessiodevoto 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F97 中修复 QFilters 和 DuotAttention 在与包装器模型一起使用时的问题\n* 由 @alessiodevoto 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F98 中添加 HuggingFace 排行榜\n* 由 @alessiodevoto 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F101 中修复基准测试目录中的链接\n* 由 @Janghyun1230 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F93 中新增 KVzipPress\n* 由 @alessiodevoto 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F103 中测试按头压缩\n* 由 @giulio98 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F100 中实现仅在预填充阶段运行骨干模型\n* 由 @alessiodevoto 在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F105 中增加与 Transformers 的兼容性及评估功能\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fcompare\u002Fv0.2.8...v0.2.9","2025-07-28T12:39:17",{"id":200,"version":201,"summary_zh":202,"released_at":203},136196,"v0.2.8","## What's Changed\r\n🐛 Bug Fixes\r\n\r\n* Fix failing tests by @maxjeblick in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F94\r\nReverts changes to `CriticalKVPress` performed in #90 that caused the press to initialize incorrectly. The PR also fixes some test logic.\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fcompare\u002Fv0.2.7...v0.2.8","2025-07-08T10:21:44",{"id":205,"version":206,"summary_zh":207,"released_at":208},136197,"v0.2.7","**What's Changed**\r\n\r\n🐛 Bug Fixes\r\n- Fix FinchPress for Qwen models family by @alessiodevoto in #82\r\nResolved compatibility issues with Qwen model architecture in FinchPress compression\r\n\r\n✨ New Features\r\n- Add KeyDiffPress and BlockPress by @figuremout in #86\r\nIntroduces new compression methods based on key difference analysis\r\n- Fix for Qwen with Yarn by @giulio98 in #85\r\nEnable Yarn scaling in FinchPress and KeyRerotationPress\r\n\r\n📚 Documentation & Maintenance\r\n- Improve documentation by @maxjeblick in #90 \r\nAdd docstrings to all presses, with their corresponding parameters and paper reference.\r\n- Add @alessiodevoto's to authors by @maxjeblick in #92  🚀\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fcompare\u002Fv0.2.6...v0.2.7","2025-07-07T16:52:20",{"id":210,"version":211,"summary_zh":212,"released_at":213},136198,"v0.2.6","- Improve packaging, #71 by @emmanuel-ferdman, #77 by @fanqiNO1, SDPX headers by @maxjeblick\r\n- Add LagKVPress, #77 by @JoelSeniorLiang \r\n- Support Qwen3 and Gemma3, #81 by @alessiodevoto ","2025-06-16T10:37:04",{"id":215,"version":216,"summary_zh":217,"released_at":218},136199,"v0.2.5","- Add PyramidKVPress, #65  by @figuremout\r\n- Fix style errors, #68 by @maxjeblick\r\n- Add FinchPress, #64 and #69, by @giulio98, @miriam-16, @FaureElia and @SimJeg","2025-04-17T14:03:51",{"id":220,"version":221,"summary_zh":222,"released_at":223},136200,"v0.2.4","- Add `QFilterPress`, #54 by @NathanGodey \r\n- Update copyright dates and add citation file, #60 by @SimJeg \r\n- Add `ChunkKVPress`, #51 by @Dominic789654 \r\n","2025-03-17T12:41:28",{"id":225,"version":226,"summary_zh":227,"released_at":228},136201,"v0.2.3","- Fix distributed inference for the `ExpectedAttentionPress`, #49 by @SimJeg \r\n- Add `DuoAttentionPress`, #50 by @SimJeg","2025-02-18T16:51:44",{"id":230,"version":231,"summary_zh":232,"released_at":233},136202,"v0.2.2","- Fix style check, #48 by @maxjeblick\r\n- Add `CriticalKVPress`, #46 by @FFY0 \r\n- Add epsilon to `ExpectedAttentionPress`, #47 by @SimJeg","2025-02-12T13:39:56",{"id":235,"version":236,"summary_zh":237,"released_at":238},136203,"v0.2.1","- Add `ChunkPress`, #40 by @maxjeblick and @giulio98\r\n- Update README, including new [huggingface space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnvidia\u002Fkvpress), #41  and #42 by @SimJeg ","2025-01-21T15:21:38",{"id":240,"version":241,"summary_zh":242,"released_at":243},136204,"v0.2.0","Transformers v4.48 introduced breaking changes handled in this release. The release also features `AdaKVPress`, the first press allowing head-wise compression by patching the attention functions registered in `ALL_ATTENTION_FUNCTIONS` since v4.48. When combined with `ExpectedAttentionPress`, `AdaKVPress` achieved the best results observed yet on the RULER benchmark (see [this post](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F38#issuecomment-2580792183)).\r\n\r\n- Add `AdaKVPress`, #38 by @SimJeg and @FFY0\r\n- Handle transformers 4.48, #39 by @SimJeg \r\n- Add InfiniteBench results, #11 by @maxjeblick ","2025-01-13T17:44:24",{"id":245,"version":246,"summary_zh":247,"released_at":248},136205,"v0.1.1","## What's Changed\r\n- https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F33 by @SimJeg fixes a small bug in the pipeline\r\n- https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fpull\u002F36 by @maxjeblick sets transformers \u003C4.48 as a dependency\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fkvpress\u002Fcompare\u002Fv0.1.0...v0.1.1","2025-01-07T10:44:19"]