Direct Preference Optimization Beyond Chatbots

https://huggingface.co/blog/Dharma-AI/direct-preference-optimization-beyond-chatbots(huggingface.co)

Direct Preference Optimization (DPO) can be used to reduce specific model failure modes like text degeneration, where a model gets stuck in a repetition loop. Unlike Supervised Fine-Tuning (SFT), which optimizes token-by-token, DPO evaluates the entire output and can explicitly penalize completion-level failures. In an experiment with an OCR model, degenerate outputs were used as the "rejected" preference and correct transcriptions as the "chosen" preference for DPO training. This second training stage, applied after SFT, successfully reduced text degeneration by an average of 59.4% across five different model families, demonstrating DPO's utility beyond typical chatbot alignment tasks.

0 points•by chrisf•1 month ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?