repoGitHubTrust 82 · PrimaryPublished yesterdayLive · 9h ago
defilantech/LLMKube
Kubernetes operator for self-hosted LLM inference across a heterogeneous GPU fleet: NVIDIA CUDA, AMD Vulkan, and Apple Silicon Metal. Runtimes: llama.cpp, vLLM, TGI, mlx-server. Multi-GPU sharding, model caching, OpenAI-compatible endpoints. Apache-2.0, run across homelab and on-prem fleets, actively developed.
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Covers
newsUnderstanding dynamic resource allocation in KubernetesnewsOpenAI and Broadcom announce chip designed for LLM inference at scalenewsOpenAI and Broadcom unveil LLM-optimized inference chipnewsHow're you deploying LLMs in production now-a-days? What's the best and most affordable way? [D]newsRun NVIDIA Nemotron and OpenAI GPT OSS models on Amazon Bedrock in AWS GovCloud (US)
Covers (incoming)
Related across the graph
newsUnderstanding dynamic resource allocation in KubernetesnewsOpenAI and Broadcom announce chip designed for LLM inference at scalenewsRun NVIDIA Nemotron and OpenAI GPT OSS models on Amazon Bedrock in AWS GovCloud (US)newsOpenAI and Broadcom unveil LLM-optimized inference chipnewsSelf-hosted GitHub Actions runners on Lambda MicroVMsnewsHow're you deploying LLMs in production now-a-days? What's the best and most affordable way? [D]
