Chinese OS model beats GPT-5 and Claude Sonnet 4.5 on key benchmarks

A new open source AI model from Chinese startup Moonshot AI has outperformed OpenAI’s proprietary GPT-5 and Anthropic’s Claude Sonnet 4.5 on several standard evaluations, marking an inflection point for the competitiveness of open AI systems and putting pressure on U.S. proprietary AI firms to justify their massive investments.

The Kimi K2 Thinking model, released today, has surpassed both proprietary and open-source competitors to claim the top position in reasoning, coding, and agentic-tool benchmarks, reports VentureBeat.

Despite being fully open-source, the model now outperforms OpenAI’s GPT-5, Anthropic’s Claude Sonnet 4.5 (Thinking mode), and xAI’s Grok-4 on several standard evaluations. Developers can access the model via platform.moonshot.ai and kimi.com, whilst weights and code are hosted on Hugging Face.

Kimi K2 Thinking is a Mixture-of-Experts model built around one trillion parameters, of which 32 billion activate per inference. It combines long-horizon reasoning with structured tool use, executing up to 200 to 300 sequential tool calls without human intervention.

Key coding evaluations

According to Moonshot’s published test results, K2 Thinking achieved 44.9 per cent on Humanity’s Last Exam, a state-of-the-art score, 60.2 per cent on BrowseComp, an agentic web-search and reasoning test, 71.3 per cent on SWE-Bench Verified and 83.1 per cent on LiveCodeBench v6, key coding evaluations.

On BrowseComp, the open model’s 60.2 per cent decisively leads GPT-5’s 54.9 per cent and Claude 4.5’s 24.1 per cent. K2 Thinking also edges GPT-5 in GPQA Diamond at 85.7 per cent versus 84.5 per cent and matches it on mathematical reasoning tasks such as AIME 2025 and HMMT 2025.

Moonshot AI has formally released Kimi K2 Thinking under a Modified MIT License on Hugging Face. The licence grants full commercial and derivative rights but adds one restriction: if the software or any derivative product serves over 100 million monthly active users or generates over $20 million USD per month in revenue, the deployer must prominently display “Kimi K2” on the product’s user interface.

Despite its trillion-parameter scale, K2 Thinking’s runtime cost remains modest at $0.15 per 1 million tokens for cache hit, $0.60 per 1 million tokens for cache miss and $2.50 per 1 million tokens output. These rates are competitive and an order of magnitude below GPT-5 at $1.25 input per million tokens and $10 output per million tokens.

The arrival of K2 Thinking comes amid growing scrutiny of the financial sustainability of AI’s largest players. OpenAI CFO Sarah Friar sparked controversy after suggesting at WSJ Tech Live event that the U.S. government might eventually need to provide a “backstop” for the company’s more than $1.4 trillion in compute and data-centre commitments.

For enterprises, the developments suggest that high-end AI capability is no longer synonymous with high-end capital expenditure. If an enterprise customer can get comparable or better performance from a free, open source Chinese AI model than they do with paid, proprietary AI solutions, the model puts pressure on U.S. firms to justify continued investment.