Batching & Scheduling
Continuous batching, iteration-level scheduling, and how vLLM keeps GPUs saturated.
Assumes: transformer architecture, GPU memory hierarchy, forward pass mechanics
When you send a request to an LLM API, you're not the only one. There are hundreds of requests hitting the same GPU simultaneously. How those requests get scheduled determines whether that GPU is 90% utilized or 20% utilized. Continuous batching is the answer ā and it's one of the reasons vLLM changed how everyone serves LLMs. This post covers why static batching leaves GPUs idle, how iteration-level scheduling fixes it, and what happens when memory fills up mid-generation.
The problem with static batching
The naive approach is static batching: collect a fixed number of requests, run them together as a batch, wait for all of them to finish, then take the next batch.
The problem is that sequences finish at different times. A prompt like "What is 2+2?" might finish in 5 tokens. A prompt like "Write me a story about a dragon" might take 500. If you batch them together, the GPU sits idle waiting for the slow sequence to finish before it can pick up new work.
gantt
title Static Batching ā GPU waits for the slowest sequence
dateFormat X
axisFormat %s tokens
section req_1
Processing :done, 0, 5
Idle :crit, 5, 500
section req_2
Processing :done, 0, 500
section req_3
Processing :done, 0, 200
Idle :crit, 200, 500
GPU utilization tanks. The longer the tail, the worse it gets.
Continuous batching
The fix is straightforward: don't wait. As soon as one sequence finishes, swap in a new one from the queue. Every iteration of the generation loop, the scheduler decides what to run next.
This is called iteration-level scheduling ā instead of scheduling at the request level, you schedule at every forward pass. The batch composition changes dynamically as sequences finish and new ones enter.
gantt
title Continuous Batching ā gaps fill immediately
dateFormat X
axisFormat %s
section Slot 0
req_1 :done, 0, 400
section Slot 1
req_2 :done, 0, 100
req_4 :done, 100, 300
section Slot 2
req_3 :done, 0, 300
req_5 :done, 300, 600
GPU stays busy. Throughput goes up dramatically. The original vLLM paper showed up to 23x higher throughput compared to static batching for certain workloads.
Prefill and decode are different problems
Before looking at how the scheduler works, it's worth understanding the two fundamentally different operations happening during inference.
Prefill is processing the input prompt. All tokens are known upfront, so the model can process them in parallel in a single forward pass. This is compute-bound ā the GPU is doing a lot of matrix multiplications.
Decode is generating output tokens one at a time. Each step depends on the previous token, so there's no parallelism within a single sequence. This is memory-bound ā the GPU spends most of its time loading weights and KV cache from memory.
Prefill ā all input tokens processed in one parallel pass
["The", "quick", "brown", "fox"]
ā ā ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā transformer pass ā one shot, compute-heavy
āāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
first output token
Decode ā one token at a time, each depends on the last
step 1: [prompt tokens...] ā "jumps"
step 2: [prompt tokens... "jumps"] ā "over"
step 3: [prompt tokens... "jumps", "over"] ā "the"
ā
KV cache grows by one row every step, memory-heavy
How the scheduler works
Looking at nano-vllm ā a clean ~1,200-line reimplementation of vLLM ā the scheduler is surprisingly simple. It maintains two queues: waiting (requests not yet started) and running (requests currently generating).
graph TD
A[new requests] --> B[waiting queue]
B -->|can_allocate?<br/>token budget ok?| C[running queue]
D[preempted seqs] --> C
C -->|can_append?| E[GPU forward]
E --> F{check status}
F -->|seq done?| G[remove from running]
F -->|out of memory?| H[preempt, free blocks]
H --> B
Every step, it runs two phases:
# Simplified from nano-vllm scheduler.py
def schedule(self):
# Phase 1: promote waiting sequences into running
while self.waiting:
seq = self.waiting[0]
if not self.block_manager.can_allocate(seq):
break # not enough KV cache memory
if num_batched_tokens + len(seq) > max_num_batched_tokens:
break # token budget exceeded
self.running.append(self.waiting.popleft())
# Phase 2: check running sequences have space for next token
for seq in self.running:
if not self.block_manager.can_append(seq):
self.preempt(self.running.pop()) # evict if memory full
KV cache memory is the real constraint
Every token a model processes needs to store its key and value vectors in the KV cache, so future tokens can attend back to it. For a large model and long sequences, this adds up fast. A single A100 80GB can only hold so many sequences in memory at once.
The block manager in nano-vllm tracks this by dividing memory into fixed-size blocks of 256 tokens. Allocating a sequence means claiming blocks. Freeing a sequence means returning them.
GPU memory divided into KV cache blocks (256 tokens each)
āāāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā
ā r1 ā r1 ā r2 ā r2 ā r2 ā r3 ā free ā free ā
āāāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā
blk_0 blk_1 blk_2 blk_3 blk_4 blk_5 blk_6 blk_7
req_1: 2 blocks (< 512 tokens)
req_2: 3 blocks (< 768 tokens)
req_3: 1 block (< 256 tokens)
free: 2 blocks remaining
# block_manager.py
def can_allocate(self, seq) -> bool:
num_required = ceil(len(seq) / self.block_size)
return num_required <= len(self.free_block_ids)
def can_append(self, seq) -> bool:
# needs a new block if current block is full
return len(seq) % self.block_size != 0 or len(self.free_block_ids) > 0
Preemption
When memory fills up mid-generation, the scheduler preempts. It takes the lowest-priority running sequence, frees its KV cache blocks, and puts it back in the waiting queue. When memory becomes available again, the sequence gets re-admitted and its prompt gets re-prefilled.
Memory full ā preemption kicks in
before:
āāāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā
ā r1 ā r1 ā r2 ā r2 ā r3 ā r3 ā r3 ā r4 ā
āāāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā
ā no room for r4's next token
preempt r4 (lowest priority), free its block:
āāāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā
ā r1 ā r1 ā r2 ā r2 ā r3 ā r3 ā r3 ā free ā
āāāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā
r4 goes back to waiting queue, will be re-prefilled when space frees up
Re-prefilling is wasteful but unavoidable without swap space. Some implementations swap the KV cache to CPU memory instead of re-prefilling ā a tradeoff between PCIe bandwidth and recompute cost.
What this looks like in practice
A serving system running continuous batching at steady state looks like this: the GPU is always running a forward pass. Some sequences in the batch are on their first token (just prefilled), some are deep into generation. As fast sequences finish, they're replaced immediately. The batch size fluctuates but the GPU never waits.
The throughput gains are real. Continuous batching turns LLM serving from a request-level problem into a token-level scheduling problem ā and at the token level, there's almost always work to do.