Prefill Is Compute-Bound. Decode Is Memory-Bound. Why Your GPU Shouldn’t Do Both.

https://towardsdatascience.com/prefill-is-compute-bound-decode-is-memory-bound-why-your-gpu-shouldnt-do-both/(towardsdatascience.com)

LLM inference consists of two distinct phases: a compute-bound prefill stage for processing the prompt and a memory-bound decode stage for generating tokens. Standard monolithic serving architectures run both phases on the same GPU, leading to significant resource underutilization as the hardware is overprovisioned for one phase or the other. Disaggregated inference addresses this inefficiency by splitting prefill and decode tasks onto separate, specialized hardware pools connected by a fast network. This allows each pool to be scaled independently and optimized for its specific workload, resulting in significant cost reductions. The primary trade-off is the network latency incurred when transferring the large KV-cache from the prefill worker to the decode worker.

0 points•by chrisf•3 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?