Which tokens does a hybrid model predict better?
Researchers at Ai2 compare token-level prediction differences between their 7B transformer (OLMo 3) and hybrid model (OLMo Hybrid), which combines attention and recurrent layers. The study finds hybrid models outperform transformers on meaning-bearing tokens like nouns, verbs, and adjectives, and on tokens requiring contextual tracking such as pronoun resolution. However, the hybrid's advantage nearly vanishes on verbatim repeated text, where attention's ability to directly look up earlier tokens gives transformers the edge. The work also proposes using filtered token losses — scoring only specific token categories — as a more fine-grained evaluation metric to surface architectural differences during pretraining that aggregate loss metrics would miss.
