PagedAttention Deep Dive - Nagging Questions - Vol. 1

My journey down the rabbit hole of LLM inference optimization started with a LinkedIn post by Anshuman Mishra. It perfectly summarized the "what" and "why" of PagedAttention and showed me that the key bottleneck in LLM serving is KV cache memory management and that PagedAttention solves this by eliminating fragmentation, enabling larger batches and sharing, leading to a 2-4x throughput improvement.

The post was the perfect high-level overview and was the perfect starting point to investigate some more low level details. How does it map to the GPU's SIMD architecture.

Q1: How does the GPU track these non-contiguous blocks? Is it a linked list?

My first thought was that if the KV cache blocks are non-contiguous, they must be tracked like a linked list. This would be great if we are dealing with a single thread but a major bottleneck in a scenario of GPU where thousands of threads could be scheduled at a time.

The In-Depth Answer: A traditional linked list is catastrophic for GPU performance due to pointer chasing. Each step of traversing the list is a dependent memory read, a fundamentally sequential operation that would stall thousands of parallel threads.

Instead, PagedAttention uses a block table. This is a small lookup tensor (an array) maintained for each sequence.

Structure: For a sequence, block_table[i] contains the unique ID or base address of the physical block that corresponds to the i-th logical block. This block table is passed as an argument to the CUDA kernel. When a GPU thread needs the K/V data for a specific logical token position, it mathematically calculates its logical block index and the offset within that block. It then performs a single, independent memory read from the block table to find the physical block's base address.
How does Lookup work: The address calculation is immediate and parallelizable. There is no sequential traversal. This mechanism provides the logical continuity of a linked list with the random-access performance of an array, which is ideal for the GPU.

physical_address = base_address_from_block_table + token_offset

Q2: So the kernel just fetches all the K,V data at once?

Understanding the block table, my next thought was about the workflow. How does the kernel use this information?

The In-Depth Answer: The kernel executes a seamless, just-in-time "lookup, load, and compute" cycle for every element of the attention calculation. The primary inputs to the attention kernel are:

The Q matrix.
A tensor containing the block tables for every sequence in the batch.
A pointer to the base of the entire physical KV cache memory pool to get the K and V matrix

When the kernel launches, thousands of threads work in parallel. A thread (or all threads in a warp) perform the address lookup using the block table on the fly. It fetches the required K and V vectors from the calculated physical address and immediately uses them in its computation. This happens simultaneously for all queries, all keys, and all attention heads in the batch.

Q3: What happens if the data isn't even in VRAM?

Anshuman's post mentioned swapping. What if I need K,V vectors for a 200k-token sequence, but my GPU can only hold 64k tokens in its VRAM?

The In-Depth Answer: This triggers KV cache swapping. The remaining data resides in CPU RAM.

Detection: The block table for the sequence contains entries that don't point to a VRAM address. Instead, they hold a special flag indicating that the block is "offloaded" to the CPU.
How the Swap-In Kicks in: When the scheduler decides to run this sequence, it must first bring the required blocks to the GPU. It allocates free physical blocks in VRAM and initiates a memory copy from CPU RAM over the PCIe bus. This is a major bottleneck; PCIe Gen5 bandwidth is around 128 GB/s, whereas a high-end GPU's HBM3 VRAM bandwidth can exceed 3,000 GB/s.
How does Eviction happen: If VRAM is full, the scheduler must first evict other blocks to make space. It typically uses a Least Recently Used (LRU) policy, swapping out the blocks of the sequence that hasn't been processed recently. This is a bidirectional flow of data across the PCIe bus.
Table Update: After a successful swap-in, the block table is updated with the new physical VRAM addresses, and the sequence becomes "runnable."

Q4: Does the kernel process what it has and then wait for the rest?

If some blocks are in VRAM but others need to be swapped in, does the kernel start partial work?

The In-Depth Answer: No. The attention mechanism is an all-or-nothing computation. This is a strict mathematical dependency since all QKV entries should be present. It is not an implementation choice. The core of the attention formula includes a softmax function, which normalizes the attention scores into a probability distribution. The denominator of the softmax is the sum of exponentiated scores over all keys (Σ exp(q·k_i)). If you were to compute this with only a subset of the keys, the resulting probability distribution would be incorrect and the final output vector would be mathematically invalid. Therefore, the kernel cannot run until a staging phase is complete, during which the scheduler ensures every single required KV block for that sequence is physically present in VRAM.

Q5: What if multiple requests are all waiting for swaps?

This sounds like my nightmare scenario - being stuck in a traffic jam in Bangalore. If several long sequences all need to swap data from the CPU, what does the system do?

The In-Depth Answer: Here's where the scheduler of vLLM comes into play. It moves beyond a simple queue and acts as a complex resource manager to prevent system gridlock. In each scheduling cycle, it evaluates the state of all running, waiting, and swapped requests and makes a heuristic-based decision. Its policies consider multiple factors:

Swap Cost: It prioritizes requests that require the least amount of data to be transferred over the PCIe bus, as this will unblock a request and feed the GPU faster.
Policy around Eviction: It chooses which blocks to swap out (evict) to make space (LRU algorithm maybe) to minimize the chance of needing that same block again soon.
Preventing Starvation: The scheduler has to track how long requests have been waiting to ensure that large, costly requests eventually get processed and are not perpetually overtaken by smaller, cheaper ones.
Overlapping I/O and Compute: Hiding the latency of PCIe transfers is super important. While one request is being swapped in (an I/O-bound operation), the scheduler will instruct the GPU to work on any other requests that are already fully available in VRAM.

Bonus Q: How do Tensor Cores fit into the scheduler's logic?

Modern GPUs have specialized Tensor Cores. Does the vLLM scheduler manage them directly?

The In-Depth Answer: The vLLM scheduler's influence is indirect. It does not manage hardware-level instruction dispatch. However, it knows when tensorcores will be used!

Hardware Reality: Tensor Cores are specialized units within each SM that execute matrix-multiply-accumulate operations (D = A * B + C) with extreme efficiency, especially on lower-precision formats like FP16, BF16, or INT8.
Kernel Implementation: Optimized kernels like FlashAttention are written in low-level languages like Triton or use libraries like CUTLASS and they generate code to run on tensorcores.
Scheduler's Role: The vLLM scheduler understands this aspect through performance estimation. It knows that a model using FP16 precision will have a much shorter kernel execution time than one using FP32 because the underlying kernel will automatically leverage Tensor Cores. This faster estimated runtime is a crucial variable in its prioritization and batching decisions!

Another Bonus Q: Isn't constantly fetching from VRAM inefficient?

This "just-in-time" fetching from global VRAM raised a red flag. Shared memory is obviously much faster. Are we really going back to VRAM (global memory, still on GPU) again and again?

The In-Depth Answer: Yes, that would be a huge bottleneck. The efficiency comes from optimizing lower level operations like Matrix Multiplication, concurrently making it ready for next steps. The goal is to increase arithmetic intensity - the ratio of compute operations to memory operations. We also need to remember that Attention computation is a fundamentally IO-bound operation. Flash Attention has spent a lot of time exploring this area and making this operation as fast as possible.

There are a lot of concepts here that I do not understand and I will devote a future blogpost on this.