Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

Instruction tuning for speech language models (SLMs) is substantially more challenging than for text-based large language models (LLMs), as it requires learning a new modality and a wide range of speech-specific instructions in addition to those supported by text LLMs. Existing SLM training approaches largely replicate the text LLM training paradigm by synthesizing large-scale speech pre-training and instruction-tuning datasets. However, this strategy is difficult to scale, since speech sequences are significantly longer than text sequences. In this paper, we propose SpeechCombine, an instruct

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorCongrui Du

    Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

  • Linked via arxiv authorYang Zhang

    Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

  • Linked via arxiv authorKaizhi Qian

    Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

  • Linked via arxiv authorShiyu Chang

    Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

Has model

Covers

Implements

Related to

authored (incoming)

Related across the graph

Topics