COLING 2025 Tutorial:
Speculative Decoding for Efficient LLM Inference

1The Hong Kong Polytechnic University, 2SEA AI Lab, 3TikTok

Sunday, January 19, 09:00 - 12:30 (GST), Tutorial 1
Abu Dhabi National Exhibition Centre, Capital Suite 7

About this tutorial

This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative decoding paradigm to mitigate the high inference latency stemming from autoregressive decoding in LLMs. At each decoding step, SD efficiently drafts several future tokens and then verifies them in parallel. This approach, unlike traditional autoregressive decoding, facilitates the simultaneous decoding of multiple tokens per step, thereby achieving promising 2x-4x speedups in LLM inference while maintaining original distributions.

This tutorial delves into the latest techniques in SD, including draft model architectures and verification strategies. Additionally, it explores the acceleration potential and future research directions in this promising field. We aim for this tutorial to elucidate the current research landscape and offer insights for researchers interested in Speculative Decoding, ultimately contributing to more efficient LLM inference.

Schedule

Our tutorial will be held on January 19 (all the times are based on GST = Abu Dhabi local time).

Time Section Presenter
09:00—09:40 Part I: Introduction & Definition Heming
09:40—10:25 Part II: History and A Taxonomy of Methods Qian
10:25—10:30 Q & A Session I
10:30—11:00 Coffee break
11:00—11:40 Part III: Cutting-edge Algorithms Heming
11:40—12:10 Part IV: Downstream Adaptations Yongqi
12:10—12:30 Part V: Final Remarks and Outlook + Q & A Session II Yongqi

Reading List

Bold papers are discussed in detail during our tutorial.

For further information, we recommend referring to our Survey and Reading List on Speculative Decoding.


Part II: History


Part II: A Taxonomy of Methods


Part III: Cutting-edge Algorithms


Part IV: Downstream Adaptations

BibTeX

@article{ speculative-decoding-tutorial,
  author    = { Xia, Heming and Du, Cunxiao and Li, Yongqi and Liu, Qian and Li, Wenjie },
  title     = { COLING 2025 Tutorial: Speculative Decoding for Efficient LLM Inference },
  journal   = { COLING 2025 },
  year      = { 2025 },
}