Language Switch

January 30, 2026
Henry (Lifan) Wang

Offline Evaluation Is Dead: How Consumer AI Learns from Millions of Real User Decisions Offline Evaluation Is Dead: How Consumer AI Learns from Millions of Real User Decisions

Once consumer AI products reach scale, offline evaluation isn't just "not good enough"—it's irrelevant to production decisions. This conclusion comes from running systems at tens of millions of users, processing billions of conversations and behavioral signals every month. We've seen this pattern repeatedly in real user environments: models that top public creative writing benchmarks often underperform in actual usage.

在消费级 AI 产品达到一定规模之后,offline evaluation 并不是"不够好",而是已经在生产环境中失去决策意义。这一结论来自我们在千万级用户规模下的线上观察,系统每月处理数 billion 级对话与行为反馈。我们在真实用户环境中反复观察到:在公开创作类 benchmark 上排名靠前的模型,在实际用户使用中往往表现平平。