paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

Speculative decoding accelerates inference by using a lightweight draft model to generate candidate tokens in parallel, and are then verified by the target model, enabling lossless acceleration. Recently, diffusion-based speculative decoding further improves parallelism by generating multiple tokens per forward pass via block-level diffusion, achieving state-of-the-art (SOTA) performance. However, existing methods adopt a fixed inference block size and assume a uniform optimal decoding strategy across all inputs. In this paper, we show that this assumption is suboptimal, as the optimal block s

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

news[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS newsDSpark: Speculative decoding accelerates LLM inference [pdf]

Implements (incoming)

reposgl-project/SpecForge

Related across the graph

news[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS reposgl-project/SpecForge newsDSpark: Speculative decoding accelerates LLM inference [pdf]

Topics

cs.CL