Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
Instruction tuning for speech language models (SLMs) is substantially more challenging than for text-based large language models (LLMs), as it requires learning a new modality and a wide range of speech-specific instructions in addition to those supported by text LLMs. Existing SLM training approaches largely replicate the text LLM training paradigm by synthesizing large-scale speech pre-training and instruction-tuning datasets. However, this strategy is difficult to scale, since speech sequences are significantly longer than text sequences. In this paper, we propose SpeechCombine, an instruct
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorCongrui Du →
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
- Linked via arxiv authorYang Zhang →
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
- Linked via arxiv authorKaizhi Qian →
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
- Linked via arxiv authorShiyu Chang →
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
