Cac
19 items across the graph — tagged with Cac.
From the graph · 19
LMCache: Supercharge Your LLM with the Fastest KV Cache Layer
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
The most accurate document search and store for building AI apps
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
Unified KV Cache Compression Methods for Auto-Regressive Models
[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
vMLX - JANGTQ Uber Compressed MLX Models - L2 Disk Cache (survives restart) + L1 Paged (super fast ttft) + Hybrid SSM Scheduler + Cont Batching + etc!
Pure Rust + CUDA LLM inference engine — no PyTorch, OpenAI-compatible, serves Qwen3 to Kimi-K2
Redis Vector Library (RedisVL) -- the AI-native Python client for Redis.
Self-hosted AI agent OS. Your memory, chat, agents, and files stay on hardware you own, offline by default, cloud by choice. Offline AI memory (taOSmd), self-ho…
Persist and reuse KV Cache to speedup your LLM.
Suite of tools containing an in-memory vector datastore and AI proxy
Anthropic Claude API wrapper for Go
High-performance KV cache storage for LLM inference — GPU offloading, SSD caching, and cross-node sharing via RDMA. Works with vLLM and SGLang.
Extreme weight + KV cache compression for LLMs on Apple Silicon (MLX implementation of Google's TurboQuant)
linked of Romanian fiscal authority(MFP-ANAF-GOV-RO)
RAMen is a fast in-memory data store like Redis, but built for AI: drop-in Redis protocol, native vector search, semantic caching, and a built-in MCP server for…
Native Windows build of vLLM 0.24.0 - no WSL, no Docker. Python 3.13 + CUDA 12.8 + PyTorch 2.11 cu128 for RTX 30/40/50-series, pre-built wheel, Windows patchset…
Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cach…
