paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

Instruction tuning for speech language models (SLMs) is substantially more challenging than for text-based large language models (LLMs), as it requires learning a new modality and a wide range of speech-specific instructions in addition to those supported by text LLMs. Existing SLM training approaches largely replicate the text LLM training paradigm by synthesizing large-scale speech pre-training and instruction-tuning datasets. However, this strategy is difficult to scale, since speech sequences are significantly longer than text sequences. In this paper, we propose SpeechCombine, an instruct

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorCongrui Du →
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
Linked via arxiv authorYang Zhang →
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
Linked via arxiv authorKaizhi Qian →
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
Linked via arxiv authorShiyu Chang →
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning