This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative decoding paradigm to mitigate the high inference latency stemming from autoregressive decoding in LLMs. At each decoding step, SD efficiently drafts several future tokens and then verifies them in parallel. This approach, unlike traditional autoregressive decoding, facilitates the simultaneous decoding of multiple tokens per step, thereby achieving promising 2x-4x speedups in LLM inference while maintaining original distributions.
This tutorial delves into the latest techniques in SD, including draft model architectures and verification strategies. Additionally, it explores the acceleration potential and future research directions in this promising field. We aim for this tutorial to elucidate the current research landscape and offer insights for researchers interested in Speculative Decoding, ultimately contributing to more efficient LLM inference.
Our tutorial will be held on January 19 (all the times are based on GST = Abu Dhabi local time).
Time | Section | Presenter |
---|---|---|
09:00—09:40 | Part I: Introduction & Definition | Heming |
09:40—10:25 | Part II: History and A Taxonomy of Methods | Qian |
10:25—10:30 | Q & A Session I | |
10:30—11:00 | Coffee break | |
11:00—11:40 | Part III: Cutting-edge Algorithms | Heming |
11:40—12:10 | Part IV: Downstream Adaptations | Yongqi |
12:10—12:30 | Part V: Final Remarks and Outlook + Q & A Session II | Yongqi |
Bold papers are discussed in detail during our tutorial.
For further information, we recommend referring to our Survey and Reading List on Speculative Decoding.
@article{ speculative-decoding-tutorial,
author = { Xia, Heming and Du, Cunxiao and Li, Yongqi and Liu, Qian and Li, Wenjie },
title = { COLING 2025 Tutorial: Speculative Decoding for Efficient LLM Inference },
journal = { COLING 2025 },
year = { 2025 },
}