Spatial ML models for real estate and similar domains face unique pitfalls that make them appear more generalizable than they are. Six key traps are examined: the Proximity and Persistence Trap (random splits allow spatial/temporal leakage, inflating performance), the Coverage Illusion (aggregate metrics hide poor performance in sparse regions), the Boundary Illusion (administrative geographic boundaries distort model signals), Geographic Bias (location features act as proxies for protected attributes like race), Hedonic Oversimplification (observable property attributes cannot fully explain prices across geographies), and the Silent Maintenance Tax (models degrade as markets shift without proper monitoring). A practical experiment using London house price data demonstrates how switching from random to temporal-spatial holdout validation dramatically changes model rankings, with GPBoost outperforming CatBoost in the harder generalization setting. Spatial+ cross-validation and spatio-temporal resampling are recommended as more rigorous evaluation strategies.
Nguồn: https://towardsdatascience.com/why-powerful-ml-is-deceptively-easy-part-2. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.