Tuesday, June 02, 2026

 

AI fails classic attention test



PNAS Nexus
AI Stroop 

image: 

Dissociation between task recognition and task execution in Claude 3.5 Sonnet without an explicit prompt. (a) Screenshot of the unprompted conversation (January 10, 2025) in which the model identifies the Stroop paradigm and generates word-color relationship mappings, yet achieves only 70% accuracy (7 of 10 correct) on an incongruent list. (b) The 10-word incongruent stimulus image provided as the sole input, without accompanying task instructions. This dissociation suggests that recognition of task structure alone is insufficient to engage the conflict-resolution mechanisms required for accurate performance.
 

view more 

Credit: Suketu Chandrakant Patel, Hongbin Wang, and Jin Fan





Giving AI a classic psychological test reveals an inherent weakness in LLM decision-making abilities.

Suketu Patel and colleagues explored how transformer-based machine attention differs from human attention by testing AI models on the “Stroop task,” in which words for colors are printed in colored ink, and participants are asked to name the ink color of each word while ignoring its meaning. The task is clinically used to assess executive control, especially a person’s ability to inhibit an automatic response. Although humans generally take longer to answer correctly when words and colors are mismatched than when they match, they can still perform stably and with high accuracy even on long word lists. 

The authors found that when the word and ink color did not match, LLMs performed well with a list of five words. But as the list of words grew longer, AI performance degraded dramatically. GPT-4o dropped from 91% accuracy at 5 words to 57% accuracy at 10 words and 15% accuracy at 40 words. Claude 3.5 Sonnet was stable through 20 words, but crashed to 24% accuracy at 40 words. In trials with a list of words in both matching and mismatched colors, LLM performance was even worse, dropping to near 0% accuracy for the mismatched items. Similar results were found with GPT-5, Claude Opus 4.1, and Gemini 2.5. LLMs struggled to stay focused on naming the color rather than defaulting to word reading. As with humans, LLMs are better trained on word reading than on color naming, yet humans can suppress word reading in long lists and maintain focus on the task at hand. According to the authors, the performance collapse of LLMs suggests fundamental limitations compared with biological attention.

No comments: