From Single-Turn RLHF to Temporal Preference Learning 从 Single-Turn RLHF 到 Temporal Preference Learning
Single-turn RLHF trains models to win the next turn. Consumer AI needs to train them to earn the next session. We kept running into the same paradox in production: models that score higher on offline single-turn quality don't reliably win on long-horizon A/B metrics—retention, re-engagement, conversation depth.
Single-turn RLHF 训练的是"赢下一轮";消费级 AI 需要训练的是"让用户明天回来"。我们在生产环境里反复撞到一个悖论:离线单轮质量更高的模型,在长程 A/B(留存、复聊、对话深度)里并不能稳定胜出。