0
Stride and prejudice: How a 32-bit overflow corrupted a CUDA kernel (and stayed hidden for weeks)
https://www.ai21.com/blog/vllm-cuda-integer-overflow/(www.ai21.com)A mysterious log probability mismatch occurred during the GRPO training of the Jamba 3B model, creating a difficult debugging challenge. The issue, which initially seemed like a training instability, was isolated to the vLLM rollout path after discovering the error spikes correlated directly with the number of rollouts. The root cause was ultimately identified as a silent 32-bit unsigned integer overflow deep within a CUDA kernel. This bug only triggered when the number of cache slots exceeded a specific high threshold, which explains why it remained hidden for an extended period and was so difficult to reproduce.
0 points•by chrisf•2 hours ago