paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago

RaBitQCache: Rotated Binary Quantization for KVCache in Long Context LLM Inference

Long-context Large Language Model inference is severely bottlenecked by the massive Key-Value (KV) cache, yet existing sparse attention methods often suffer from static fixed-budget (Top-k) retrieval or rely on proxy scores that are computationally expensive and biased. To address these limitations, we propose RaBitQCache, a novel sparse attention framework that utilizes randomized rotated binary quantization and high-throughput binary-INT4 arithmetic to efficiently estimate attention weights. Our proxy score serves as an unbiased estimator with a proven error bound, enabling adaptive Top-p re

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

newsBreakthrough in long-context efficiency announced

Implements (incoming)

repothu-pacman/chitu

Related across the graph

repothu-pacman/chitu newsBreakthrough in long-context efficiency announced

Topics

cs.CL