WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs
Large Language Model (LLM) inference workloads are a rapidly growing contributor to data center energy consumption. Optimizing these deployments requires matching specific LLMs to the most efficient GPUs, but operators currently lack the tools to do so without exhaustively profiling each combination. While some predictive models exist, they still require profiling data and struggle to generalize to hardware unseen during training. To address this, we introduce \textit{WattGPU}, featuring two predictive models for mean GPU power draw and Inter-Token Latency (ITL). Our approach leverages only pu
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorMauricio Fadel Argerich →
WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs
- Linked via arxiv authorJonathan Fürst →
WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs
- Linked via arxiv authorMarta Patiño-Martínez →
WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs
