Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 2h ago

WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs

Large Language Model (LLM) inference workloads are a rapidly growing contributor to data center energy consumption. Optimizing these deployments requires matching specific LLMs to the most efficient GPUs, but operators currently lack the tools to do so without exhaustively profiling each combination. While some predictive models exist, they still require profiling data and struggle to generalize to hardware unseen during training. To address this, we introduce \textit{WattGPU}, featuring two predictive models for mean GPU power draw and Inter-Token Latency (ITL). Our approach leverages only pu

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Covers

Implements

Implements (incoming)

authored (incoming)

Related across the graph

Topics