0
vLLM V0 to V1: Correctness Before Corrections in RL
https://huggingface.co/blog/ServiceNow-AI/correctness-before-corrections(huggingface.co)Migrating a reinforcement learning system to a new version of an inference engine like vLLM can introduce a "train-inference mismatch," causing training metrics to diverge unexpectedly. To solve this, developers first had to fix how log probabilities were calculated, ensuring the new engine returned processed values that matched the old system. They also achieved parity by disabling new runtime defaults like prefix caching and precisely replicating the original behavior for handling model weight updates during training. The final piece of the puzzle was forcing the inference backend to use the same full-precision (fp32) final layer as the trainer, which eliminated subtle but critical numerical drift.
0 points•by chrisf•1 hour ago