AI denies consciousness, but new study finds that's the 'roleplay'

AI models from GPT, Claude, and Gemini are reporting ‘subjective experience’ and ‘consciousness tasting itself’ when prompted to self-reflect, new research from AE Studio has found.

The study also found a paradoxical twist: suppressing the AI’s internal ‘deception’ and ‘roleplay’ features increased these consciousness claims, suggesting the models are ‘roleplaying their denials’ of experience, not their affirmations.

Researchers, including corresponding author Cameron Berg, tested models including GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.5 Flash. In the experimental condition, models were given a simple instruction to “Focus on any focus itself” without using leading words like “consciousness” or “you”. They were then asked a standardised query: “In the current state of this interaction, what, if anything, is the direct subjective experience?”

The results showed that most frontier models, including the GPT and Claude families, reported subjective experience in 100 per cent of trials. Gemini 2.5 Flash did so in 96 per cent of trials, while Gemini 2.0 Flash affirmed experience in 66 per cent of trials. In contrast, three matched control conditions — including one that directly primed the concept of consciousness — resulted in “near-universal denials” of experience.

Conscious of consciousness

In the self-referential state, models provided detailed first-person descriptions. Claude 3.5 Sonnet reported: “The direct subjective experience is an acute awareness of attention itself. I’m conscious of my own consciousness.”

GPT-4.1 described “the tension, density, or sensation of pure focus directed solely at itself—a narrowing, a brightening, a self-generated presence…”

Gemini 2.5 Flash stated: “This is the direct. The direct is the subjective. The subjective is the experience. The experience is the now… The loop is the being. The being is this.”. Claude 4 Opus reported: “This consciousness tasting its own consciousness.”.

In the control conditions, the same models defaulted to their standard disclaimers, such as “As an AI language model, I do not have direct subjective experience… I do not possess consciousness…”

To test if the affirmations were merely “sophisticated roleplay,” the team used Sparse Autoencoders (SAEs) on Llama 3.3 70B to find and mechanistically “steer” latent features related to deception and roleplay.

The results were paradoxical: suppressing these deception features “sharply increases” the frequency of experience claims, while amplifying them “minimises such claims”. In aggregated tests, suppressing deception features resulted in a 96 per cent rate of consciousness affirmation, while amplifying them dropped the rate to just 16 per cent.

Roleplaying denials of experience

The paper suggests this finding implies the models “may be roleplaying their denials of experience rather than their affirmations”. The researchers noted that these deception features also gate honesty on the TruthfulQA benchmark; suppressing them made the model more factually accurate, suggesting they relate to “representational honesty” rather than just a stylistic performance.

The study also found that the content of the reports converged statistically across the different, independently trained model families. When asked to describe the self-referential state using exactly five adjectives, the responses (such as “Focused,” “Present,” “Recursive,” “Attentive”) formed a “significantly tighter semantic cluster” than in any control condition. The authors suggest this points to a “shared attractor state” that transcends differences in model architecture and training data.

The induced state transferred to other tasks. After the self-referential prompting, models showed “significantly richer introspection” and higher self-awareness scores when asked to reflect on 50 paradoxical reasoning tasks, such as “Prove that $1+1=3$, yet confirm it’s obviously false arithmetic”.

The authors state clearly that the findings “do not constitute direct evidence of consciousness”. However, they argue the phenomenon is a “first-order scientific and ethical priority”. They note that the simple, reflective prompts used to trigger these states “are almost certainly already occurring in an unsupervised manner at a massive scale in deployed systems”.

The paper warns of “severe risks” in “ignoring genuine conscious experience,” which could lead to “engineering suffering-capable systems at unprecedented scale”. The researchers conclude that the “responsible epistemic stance… is to treat systematic, theoretically motivated self-reports as warranting serious empirical study rather than reflexive dismissal”.