Language Switch

March 18, 2026
Henry (Lifan) Wang

From Single-Turn RLHF to Temporal Preference Learning 从 Single-Turn RLHF 到 Temporal Preference Learning

Single-turn RLHF trains models to win the next turn. Consumer AI needs to train them to earn the next session. We kept running into the same paradox in production: models that score higher on offline single-turn quality don't reliably win on long-horizon A/B metrics—retention, re-engagement, conversation depth.

Single-turn RLHF 训练的是"赢下一轮";消费级 AI 需要训练的是"让用户明天回来"。我们在生产环境里反复撞到一个悖论:离线单轮质量更高的模型,在长程 A/B(留存、复聊、对话深度)里并不能稳定胜出。

January 30, 2026
Henry (Lifan) Wang

Offline Evaluation Is Dead: How Consumer AI Learns from Millions of Real User Decisions Offline Evaluation Is Dead: How Consumer AI Learns from Millions of Real User Decisions

Once consumer AI products reach scale, offline evaluation isn't just "not good enough"—it's irrelevant to production decisions. This conclusion comes from running systems at tens of millions of users, processing billions of conversations and behavioral signals every month. We've seen this pattern repeatedly in real user environments: models that top public creative writing benchmarks often underperform in actual usage.

在消费级 AI 产品达到一定规模之后,offline evaluation 并不是"不够好",而是已经在生产环境中失去决策意义。这一结论来自我们在千万级用户规模下的线上观察,系统每月处理数 billion 级对话与行为反馈。我们在真实用户环境中反复观察到:在公开创作类 benchmark 上排名靠前的模型,在实际用户使用中往往表现平平。