Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction
Abstract
A novel gating-based key-value cache eviction method for frozen-weight large language models achieves high compression ratios with minimal computational overhead while maintaining near-lossless performance across diverse tasks.
Efficient key-value (KV) cache management is crucial for the practical deployment of large language models (LLMs), yet existing compression techniques often incur a trade-off between performance degradation and computational overhead. We propose a novel gating-based KV cache eviction method for frozen-weight LLMs that achieves high compression ratios with negligible computational cost. Our approach introduces lightweight sink-attention gating modules to identify and retain critical KV pairs, and integrates seamlessly into both the prefill and decoding stages. The proposed gate training algorithm relies on forward passes of an LLM, avoiding expensive backpropagation, while achieving strong task generalization through a task-agnostic reconstruction objective. Extensive experiments across the Qwen2.5-1M, Qwen3, and Gemma3 families show that our method maintains near-lossless performance while evicting up to 70% of the KV cache. The results are consistent across a wide range of tasks, including long-context understanding, code comprehension, and mathematical reasoning, demonstrating the generality of our approach.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs (2025)
- Learning What to Write: Write-Gated KV for Efficient Long-Context Inference (2025)
- KVzap: Fast, Adaptive, and Faithful KV Cache Pruning (2026)
- MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning (2025)
- Hold Onto That Thought: Assessing KV Cache Compression On Reasoning (2025)
- HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference (2026)
- KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper