Large language models remain readily distinguishable from human text even after extensive calibration, with automated classifiers detecting AI-generated social media posts with 70 to 80 per cent accuracy, according to research introducing a computational Turing test that reveals systematic differences between human and AI language.
Researchers at University of Zurich, University of Amsterdam, Duke University and New York University systematically compared nine open-weight LLMs across five calibration strategies, including fine-tuning, stylistic prompting and context retrieval, benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky and Reddit. The findings were published in arXiv.
Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up the model size does not enhance human likeness.
The researchers developed a computational Turing test, a validation framework that integrates aggregate metrics, including BERT-based detectability and semantic similarity, with interpretable linguistic features, such as stylistic markers and topical patterns, to assess how closely LLMs approximate human language within a given dataset.
Emotional and affective cues persist
Contrary to common assumptions, the findings indicate that even the most advanced models and calibration methods remain readily distinguishable from human text. Whilst optimisation can mitigate some structural differences, deeper emotional and affective cues persist as reliable discriminators.
Some sophisticated strategies, such as fine-tuning and persona descriptions, fail to improve realism or even make text more detectable, whereas simple stylistic examples and context retrieval yield modest gains. Non-instruction-tuned models such as Llama-3.1-8B, Mistral-7B and Apertus-8B outperform their instruction-tuned counterparts.
Most notably, the researchers identify a fundamental trade-off between realism and meaning: optimising for reduced detectability lowers semantic fidelity, whilst optimising for semantic accuracy does not consistently improve human-likeness.
Across platforms, affective and social categories emerge as consistent discriminators. Toxicity score consistently emerges as a critical discriminator across all three platforms, appearing among the top features for nearly all models. The research found that LLMs struggle to replicate the affective tone and social-relational language characteristic of human social media discourse.
The study assessed detectability measuring how easily human and AI text can be distinguished, semantic fidelity quantifying similarity in meaning to human reference replies, and interpretable linguistic analysis identifying the stylistic and topical features that reveal AI authorship.
Platform-specific patterns emerged, with Twitter/X content exhibiting the lowest detectability, followed by Bluesky and Reddit, where conversational norms are more diverse.