
Towards Data Science010 phút
Water Cooler Small Talk, Ep. 11: Overfitting in RAG evaluation
When a RAG evaluation set is repeatedly used to identify failures and tune the system, it quietly becomes a training set — a form of overfitting. The post explains how this happens through prompt tuning on the same test questions, cherry-picking easy examples, and writing questions derived from already-indexed documents. The fix mirrors classical ML discipline: maintain a genuinely held-out test set, build questions independently of system behavior, and treat suspiciously high scores with skepticism. The broader pattern is framed through Goodhart's Law — when a measure becomes a target, it stops being a good measure.