Compression
14 items across the graph — tagged with Compression.
From the graph · 14
Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server.
Unified KV Cache Compression Methods for Auto-Regressive Models
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
vMLX - JANGTQ Uber Compressed MLX Models - L2 Disk Cache (survives restart) + L1 Paged (super fast ttft) + Hybrid SSM Scheduler + Cont Batching + etc!
97% token reduction for AI coding sessions — zero deps, 31 languages, MCP server
Cut your Claude / OpenAI / Gemini bill 70–95% on AI coding. Local proxy that compresses context, keeps provider caches hot, and verifies LLM output ($0 hallucin…
Unlock 2x more Claude Code and Codex usage
Context-Engine MCP - Agentic Context Compression Suite
Persistent memory for LLMs and apps. Content-addressed storage with dedupe, compression, full-text and vector search.
On-device LLM Inference Powered by X-Bit Quantization
ChatCLI é uma aplicação de linha de comando que utiliza modelos de linguagem como OpenAI, GoogleAI, e outras, para conversas interativas no terminal. Suporta co…
Official code for the Manning book on structural LLM optimization: depth/width pruning, knowledge distillation, and attention optimization, runnable on free Col…
Native Windows build of vLLM 0.24.0 - no WSL, no Docker. Python 3.13 + CUDA 12.8 + PyTorch 2.11 cu128 for RTX 30/40/50-series, pre-built wheel, Windows patchset…
Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cach…
