paperarXivTrust 82 · PrimaryPublished yesterdayLive · 18h ago

An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation

While Large Multimodal Models excel in comprehension, high-throughput inference engines lack native support for multimodal generation. This is severe in Speech Language Models, where generating multi-layered audio tokens via decoupled AR+NAR or synchronous Multi-Token Prediction (MTP) with delay-pattern interleaving conflicts with standard single-stream loops. We present a vLLM-based inference pipeline for unified speech understanding and generation. We extend autoregressive decoding to natively execute delay-pattern de-interleaving and coordinated multi-stream sampling, integrating an on-GPU

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorHaoran Wang →
An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation
Linked via arxiv authorJinchuan Tian →
An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation
Linked via arxiv authorSiddhant Arora →
An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation
Linked via arxiv authorShinji Watanabe →
An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation

Implements

repothu-pacman/chitu repovllm-project/vllm repojaswon/osu-dreamer reposgl-project/sglang

Has model

modelWhisper-Lite

authored (incoming)

personHaoran Wang personJinchuan Tian personSiddhant Arora personShinji Watanabe

Related across the graph

modelWhisper-Lite repothu-pacman/chitu personSiddhant Arora repovllm-project/vllm reposgl-project/sglang personJinchuan Tian personHaoran Wang repojaswon/osu-dreamer personShinji Watanabe

Topics

cs.AI