ishmael daro/Flickr

OpenAI has introduced GDPval, a new evaluation system measuring artificial intelligence model performance against human professionals across 44 occupations, with results suggesting frontier AI models are approaching expert-level capabilities in economically valuable work.

GPT-5 achieved a 40.6 per cent win rate when compared to industry experts, whilst Anthropic’s Claude Opus 4.1 scored highest at 49 per cent across tasks spanning healthcare, finance, manufacturing, and government sectors, reports TechCrunch.

The benchmark represents a significant advancement from academic tests toward real-world professional evaluation. OpenAI’s GDPval covers nine industries contributing most to US gross domestic product, testing models on 1,320 specialised tasks crafted by professionals averaging 14 years of experience in their respective fields.

Unlike traditional AI benchmarks, GDPval tasks include reference files and context, with expected deliverables spanning documents, slides, diagrams, spreadsheets, and multimedia content. Tasks range from legal briefs and engineering blueprints to customer support conversations and nursing care plans, reflecting actual workplace responsibilities.

OpenAI’s evaluation process employs expert graders who blindly compare AI-generated outputs with human-produced work across all 44 occupations, from software developers and lawyers to registered nurses and mechanical engineers. The company then averages AI models’ win rates against human reports to establish performance metrics.

Progress appears substantial across OpenAI’s model iterations. GPT-4o, released approximately 15 months earlier, achieved only 13.7 per cent wins and ties against human experts, meaning GPT-5’s performance represents nearly triple improvement within this timeframe.

“[Because] the model is getting good at some of these things,” OpenAI chief economist Dr Aaron Chatterji told TechCrunch, “people in those jobs can now use the model, increasingly as capabilities get better, to offload some of their work and do potentially higher value things.”

OpenAI acknowledges current limitations, noting that GDPval represents “an early step that doesn’t reflect the full nuance of many economic tasks.” The evaluation focuses on one-shot assessments rather than interactive workflows requiring context building or multiple drafts, which characterise much real-world professional work.

The company tested multiple frontier models, including GPT-4o, o4-mini, OpenAI o3, GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4. OpenAI attributes Claude’s strong performance partly to superior aesthetics in document formatting and slide layout, whilst crediting GPT-5’s strength in domain-specific knowledge accuracy.

Future GDPval versions will expand to include more occupations, industries, and interactive task types, with OpenAI planning to better measure progress across diverse knowledge work scenarios involving ambiguity navigation and iterative improvement processes.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Meta launches ad-free subscriptions after ICO forces compliance changes

Meta will offer UK users paid subscriptions to use Facebook and Instagram…

World nears quarter million crypto millionaires in historic wealth boom

Global cryptocurrency millionaires have reached 241,700 individuals, marking a 40 per cent…

Wong warns AI nuclear weapons threaten future of humanity at UN

Australia’s Foreign Minister Penny Wong has warned that artificial intelligence’s potential use…

Legal scholar warns AI could devalue humanity without urgent regulatory action

Artificial intelligence systems pose worldwide threats to human dignity by potentially reducing…

MIT accelerator shows AI enhances startup building without replacing core principles

Entrepreneurs participating in MIT’s flagship summer programme are integrating artificial intelligence tools…

AI creates living viruses for first time as scientists make artificial “life”

Stanford University researchers have achieved a scientific milestone by creating the world’s…

Engineers create smarter artificial intelligence for power grids and autonomous vehicles

Researchers have developed an artificial intelligence system that manages complex networks where…

Artificial intelligence threatens subtitle writers despite creative demands of accessibility work

Professional subtitle creators face declining wages and job insecurity as artificial intelligence…