News thumbnail
Science / Wed, 10 Jun 2026 Let's Data Science

Study Finds Transformer Attention Fails Stroop Test

Researchers led by Suketu Patel tested several frontier transformer language models on the classic Stroop task, reporting a sharp, length-dependent collapse in accuracy as stimulus lists lengthened. Per the paper published in PNAS Nexus, models including GPT-4o, GPT-5, Claude 3.5 Sonnet, Claude Opus 4.1, and Gemini 2.5 scored well on short, five-item incongruent lists but fell dramatically on longer lists. The PNAS Nexus results show an example trajectory where GPT-4o dropped from 91% accuracy at five words to 57% at ten words and 15% at 40 words, and mixed matching/mismatched lists produced near-0% accuracy on the conflicting items, the paper reports. Reporting outlets (ScienceDaily, TechXplore, NeuroscienceNews, Heise) reproduced these findings and framed them as evidence of a structural divergence between human executive control and transformer attention.

Researchers led by Suketu Patel tested several frontier transformer language models on the classic Stroop task, reporting a sharp, length-dependent collapse in accuracy as stimulus lists lengthened. Per the paper published in PNAS Nexus, models including GPT-4o, GPT-5, Claude 3.5 Sonnet, Claude Opus 4.1, and Gemini 2.5 scored well on short, five-item incongruent lists but fell dramatically on longer lists. The PNAS Nexus results show an example trajectory where GPT-4o dropped from 91% accuracy at five words to 57% at ten words and 15% at 40 words, and mixed matching/mismatched lists produced near-0% accuracy on the conflicting items, the paper reports. Reporting outlets (ScienceDaily, TechXplore, NeuroscienceNews, Heise) reproduced these findings and framed them as evidence of a structural divergence between human executive control and transformer attention.

© All Rights Reserved.