Researchers led by Suketu Patel tested several frontier transformer language models on the classic Stroop task, reporting a sharp, length-dependent collapse in accuracy as stimulus lists lengthened.
Per the paper published in PNAS Nexus, models including GPT-4o, GPT-5, Claude 3.5 Sonnet, Claude Opus 4.1, and Gemini 2.5 scored well on short, five-item incongruent lists but fell dramatically on longer lists.
The PNAS Nexus results show an example trajectory where GPT-4o dropped from 91% accuracy at five words to 57% at ten words and 15% at 40 words, and mixed matching/mismatched lists produced near-0% accuracy on the conflicting items, the paper reports.
Reporting outlets (ScienceDaily, TechXplore, NeuroscienceNews, Heise) reproduced these findings and framed them as evidence of a structural divergence between human executive control and transformer attention.
Researchers led by Suketu Patel tested several frontier transformer language models on the classic Stroop task, reporting a sharp, length-dependent collapse in accuracy as stimulus lists lengthened. Per the paper published in PNAS Nexus, models including GPT-4o, GPT-5, Claude 3.5 Sonnet, Claude Opus 4.1, and Gemini 2.5 scored well on short, five-item incongruent lists but fell dramatically on longer lists. The PNAS Nexus results show an example trajectory where GPT-4o dropped from 91% accuracy at five words to 57% at ten words and 15% at 40 words, and mixed matching/mismatched lists produced near-0% accuracy on the conflicting items, the paper reports. Reporting outlets (ScienceDaily, TechXplore, NeuroscienceNews, Heise) reproduced these findings and framed them as evidence of a structural divergence between human executive control and transformer attention.