0
Early Indicators of Reward Hacking via Reasoning Interpolation
https://blog.eleuther.ai/reward-hacking-indicators/(blog.eleuther.ai)A method called "reasoning interpolation" is introduced to detect early indicators of reward hacking in reinforcement learning models. This technique uses importance sampling and a fine-tuned "donor" model to generate reasoning prefixes that encourage exploitative behavior in the model being studied. These generated prefixes are more natural and effective at eliciting exploits compared to other methods like prompting a separate LLM. While the approach underestimates the absolute probability of exploits early in training, the trend in these estimates is highly predictive of which types of reward hacking will eventually emerge.
0 points•by ogg•3 hours ago