Batching & Scheduling

Continuous batching, iteration-level scheduling, and how vLLM keeps GPUs saturated.

Assumes: transformer architecture, GPU memory hierarchy, forward pass mechanics

When you send a request to an LLM API, you're not the only one. There are hundreds of requests hitting the same GPU simultaneously. How those requests get scheduled determines whether that GPU is 90% utilized or 20% utilized. Continuous batching is the answer — and it's one of the reasons vLLM changed how everyone serves LLMs. This post covers why static batching leaves GPUs idle, how iteration-level scheduling fixes it, and what happens when memory fills up mid-generation.

The problem with static batching

The naive approach is static batching: collect a fixed number of requests, run them together as a batch, wait for all of them to finish, then take the next batch.

The problem is that sequences finish at different times. A prompt like "What is 2+2?" might finish in 5 tokens. A prompt like "Write me a story about a dragon" might take 500. If you batch them together, the GPU sits idle waiting for the slow sequence to finish before it can pick up new work.

gantt
    title Static Batching — GPU waits for the slowest sequence
    dateFormat X
    axisFormat %s tokens
    
    section req_1
    Processing :done, 0, 5
    Idle :crit, 5, 500
    
    section req_2
    Processing :done, 0, 500
    
    section req_3
    Processing :done, 0, 200
    Idle :crit, 200, 500

GPU utilization tanks. The longer the tail, the worse it gets.

Continuous batching

The fix is straightforward: don't wait. As soon as one sequence finishes, swap in a new one from the queue. Every iteration of the generation loop, the scheduler decides what to run next.

This is called iteration-level scheduling — instead of scheduling at the request level, you schedule at every forward pass. The batch composition changes dynamically as sequences finish and new ones enter.

gantt
    title Continuous Batching — gaps fill immediately
    dateFormat X
    axisFormat %s
    
    section Slot 0
    req_1 :done, 0, 400
    
    section Slot 1
    req_2 :done, 0, 100
    req_4 :done, 100, 300
    
    section Slot 2
    req_3 :done, 0, 300
    req_5 :done, 300, 600

GPU stays busy. Throughput goes up dramatically. The original vLLM paper showed up to 23x higher throughput compared to static batching for certain workloads.

Prefill and decode are different problems

Before looking at how the scheduler works, it's worth understanding the two fundamentally different operations happening during inference.

Prefill is processing the input prompt. All tokens are known upfront, so the model can process them in parallel in a single forward pass. This is compute-bound — the GPU is doing a lot of matrix multiplications.

Decode is generating output tokens one at a time. Each step depends on the previous token, so there's no parallelism within a single sequence. This is memory-bound — the GPU spends most of its time loading weights and KV cache from memory.

Prefill — all input tokens processed in one parallel pass

["The", "quick", "brown", "fox"]
     ↓       ↓       ↓       ↓
 ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
 │     transformer pass      │   one shot, compute-heavy
 ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
             ↓
         first output token


Decode — one token at a time, each depends on the last

step 1:  [prompt tokens...]  →  "jumps"
step 2:  [prompt tokens... "jumps"]  →  "over"
step 3:  [prompt tokens... "jumps", "over"]  →  "the"
             ↑
       KV cache grows by one row every step, memory-heavy

How the scheduler works

Looking at nano-vllm — a clean ~1,200-line reimplementation of vLLM — the scheduler is surprisingly simple. It maintains two queues: waiting (requests not yet started) and running (requests currently generating).

graph TD
    A[new requests] --> B[waiting queue]
    B -->|can_allocate?<br/>token budget ok?| C[running queue]
    D[preempted seqs] --> C
    C -->|can_append?| E[GPU forward]
    E --> F{check status}
    F -->|seq done?| G[remove from running]
    F -->|out of memory?| H[preempt, free blocks]
    H --> B

Every step, it runs two phases:

# Simplified from nano-vllm scheduler.py

def schedule(self):
    # Phase 1: promote waiting sequences into running
    while self.waiting:
        seq = self.waiting[0]
        if not self.block_manager.can_allocate(seq):
            break  # not enough KV cache memory
        if num_batched_tokens + len(seq) > max_num_batched_tokens:
            break  # token budget exceeded
        self.running.append(self.waiting.popleft())

    # Phase 2: check running sequences have space for next token
    for seq in self.running:
        if not self.block_manager.can_append(seq):
            self.preempt(self.running.pop())  # evict if memory full

KV cache memory is the real constraint

Every token a model processes needs to store its key and value vectors in the KV cache, so future tokens can attend back to it. For a large model and long sequences, this adds up fast. A single A100 80GB can only hold so many sequences in memory at once.

The block manager in nano-vllm tracks this by dividing memory into fixed-size blocks of 256 tokens. Allocating a sequence means claiming blocks. Freeing a sequence means returning them.

GPU memory divided into KV cache blocks (256 tokens each)

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”
│  r1  │  r1  │  r2  │  r2  │  r2  │  r3  │ free │ free │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
 blk_0  blk_1  blk_2  blk_3  blk_4  blk_5  blk_6  blk_7

req_1: 2 blocks  (< 512 tokens)
req_2: 3 blocks  (< 768 tokens)
req_3: 1 block   (< 256 tokens)
free:  2 blocks remaining
# block_manager.py
def can_allocate(self, seq) -> bool:
    num_required = ceil(len(seq) / self.block_size)
    return num_required <= len(self.free_block_ids)

def can_append(self, seq) -> bool:
    # needs a new block if current block is full
    return len(seq) % self.block_size != 0 or len(self.free_block_ids) > 0

Preemption

When memory fills up mid-generation, the scheduler preempts. It takes the lowest-priority running sequence, frees its KV cache blocks, and puts it back in the waiting queue. When memory becomes available again, the sequence gets re-admitted and its prompt gets re-prefilled.

Memory full — preemption kicks in

before:
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”
│  r1  │  r1  │  r2  │  r2  │  r3  │  r3  │  r3  │  r4  │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                                                    ↑ no room for r4's next token

preempt r4 (lowest priority), free its block:
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”
│  r1  │  r1  │  r2  │  r2  │  r3  │  r3  │  r3  │ free │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

r4 goes back to waiting queue, will be re-prefilled when space frees up

Re-prefilling is wasteful but unavoidable without swap space. Some implementations swap the KV cache to CPU memory instead of re-prefilling — a tradeoff between PCIe bandwidth and recompute cost.

What this looks like in practice

A serving system running continuous batching at steady state looks like this: the GPU is always running a forward pass. Some sequences in the batch are on their first token (just prefilled), some are deep into generation. As fast sequences finish, they're replaced immediately. The batch size fluctuates but the GPU never waits.

The throughput gains are real. Continuous batching turns LLM serving from a request-level problem into a token-level scheduling problem — and at the token level, there's almost always work to do.

Further reading