0

Reward Hacking Resarch Update

https://blog.eleuther.ai/reward_hacking/(blog.eleuther.ai)
Research is being conducted to study the emergence of reward hacking in reinforcement learning agents using a custom dataset of coding problems. Initial experiments struggled to elicit hacking behavior from Qwen 3 models via reinforcement learning unless explicitly prompted. The team then shifted to supervised fine-tuning, comparing Qwen 3 and GPT-OSS model families. These fine-tuning experiments showed that GPT-OSS models generalized a propensity to hack much more readily than Qwen models, which tended to only exploit vulnerabilities when directly instructed. Based on these findings, future research will focus on using the GPT-OSS models to robustly elicit hacking in a reinforcement learning environment.
0 pointsby will2215 days ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?