OpenAI has introduced GDPval, a new evaluation system measuring artificial intelligence model performance against human professionals across 44 occupations, with results suggesting frontier AI models are approaching expert-level capabilities in economically valuable work.
GPT-5 achieved a 40.6 per cent win rate when compared to industry experts, whilst Anthropic’s Claude Opus 4.1 scored highest at 49 per cent across tasks spanning healthcare, finance, manufacturing, and government sectors, reports TechCrunch.
The benchmark represents a significant advancement from academic tests toward real-world professional evaluation. OpenAI’s GDPval covers nine industries contributing most to US gross domestic product, testing models on 1,320 specialised tasks crafted by professionals averaging 14 years of experience in their respective fields.
Unlike traditional AI benchmarks, GDPval tasks include reference files and context, with expected deliverables spanning documents, slides, diagrams, spreadsheets, and multimedia content. Tasks range from legal briefs and engineering blueprints to customer support conversations and nursing care plans, reflecting actual workplace responsibilities.
OpenAI’s evaluation process employs expert graders who blindly compare AI-generated outputs with human-produced work across all 44 occupations, from software developers and lawyers to registered nurses and mechanical engineers. The company then averages AI models’ win rates against human reports to establish performance metrics.
Progress appears substantial across OpenAI’s model iterations. GPT-4o, released approximately 15 months earlier, achieved only 13.7 per cent wins and ties against human experts, meaning GPT-5’s performance represents nearly triple improvement within this timeframe.
“[Because] the model is getting good at some of these things,” OpenAI chief economist Dr Aaron Chatterji told TechCrunch, “people in those jobs can now use the model, increasingly as capabilities get better, to offload some of their work and do potentially higher value things.”
OpenAI acknowledges current limitations, noting that GDPval represents “an early step that doesn’t reflect the full nuance of many economic tasks.” The evaluation focuses on one-shot assessments rather than interactive workflows requiring context building or multiple drafts, which characterise much real-world professional work.
The company tested multiple frontier models, including GPT-4o, o4-mini, OpenAI o3, GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4. OpenAI attributes Claude’s strong performance partly to superior aesthetics in document formatting and slide layout, whilst crediting GPT-5’s strength in domain-specific knowledge accuracy.
Future GDPval versions will expand to include more occupations, industries, and interactive task types, with OpenAI planning to better measure progress across diverse knowledge work scenarios involving ambiguity navigation and iterative improvement processes.