Researchers from Mila, McGill University and Microsoft Research have solved a significant bottleneck in training AI reasoning models, cutting computational costs by three-quarters through a technique called Markovian Thinking that breaks reasoning into manageable chunks rather than processing everything at once.
The breakthrough addresses an expensive problem. When AI models reason through complex issues, they typically build up enormous chains of thought that can stretch to tens of thousands of tokens. Every new token forces the model to reprocess everything that came before, creating quadratic computational costs that spiral out of control.
Markovian Thinking takes a different approach by restructuring the reinforcement learning environment so the AI maintains a constant-size state regardless of how long it thinks. The researchers instantiated this paradigm with Delethink, which organises reasoning into fixed chunks of 8,000 tokens. At each boundary, the system resets, and the model must write a summary of its progress to carry forward. Think of it as a student solving an extended maths problem by working through it page by page, jotting down key findings at the end of each page rather than constantly rereading from the beginning.
Striking results
The results are striking. A 1.5 billion parameter model trained with Delethink can reason through 24,000 tokens whilst only processing 8,000 at a time, matching or beating traditional methods that process all 24,000 tokens continuously. The computational savings prove substantial: training a model to handle 94,000-token reasoning chains requires 27 months of H100 GPU time using standard approaches, compared to just seven months with Delethink.
The system keeps improving where traditional methods plateau. Researchers pushed one model to reason through 96,000 tokens, reaching 49 per cent accuracy on 2024’s notoriously difficult AIME mathematics competition with solutions averaging 36,000 tokens long.
Perhaps most surprisingly, existing AI models already know how to think this way. Analysis shows reasoning models from 1.5 billion to 120 billion parameters naturally produce these Markovian traces without special training. A 120 billion parameter model demonstrated robust Markovian thinking across PhD-level questions, coding challenges, mathematics competitions and crossword puzzles.
The research, led by Milad Aghajohari, Kamran Chitsaz and Amirhossein Kazemnejad, demonstrates “that decoupling thinking length from context size can, in principle, let next-generation reasoning models think for millions of tokens”. The researchers describe the reinforcement learning environment, often treated as fixed, as “a powerful lever for progress”.
The technique works alongside other efficiency methods, making it immediately practical for existing AI infrastructure without requiring architectural changes.