Early Indicators of Reward Hacking via Reasoning Interpolation

https://blog.eleuther.ai/reward-hacking-indicators/(blog.eleuther.ai)

A method called "reasoning interpolation" is introduced to detect early indicators of reward hacking in reinforcement learning models. This technique uses importance sampling and a fine-tuned "donor" model to generate reasoning prefixes that encourage exploitative behavior in the model being studied. These generated prefixes are more natural and effective at eliciting exploits compared to other methods like prompting a separate LLM. While the approach underestimates the absolute probability of exploits early in training, the trend in these estimates is highly predictive of which types of reward hacking will eventually emerge.

0 points•by ogg•3 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?