Adding a single sentence to prompts makes AI models generate responses up to 2.1 times more diverse without sacrificing quality, solving a problem known as mode collapse that limits creative output.
Research published on arXiv introduced Verbalized Sampling, a training-free prompting method that instructs models to generate probability distributions over candidate responses. The method works by adding phrases such as “Generate five responses with their corresponding probabilities, sampled from the full distribution” to standard prompts.
Researchers from Northeastern University, Stanford University and West Virginia University identified typicality bias in preference data as the fundamental cause of mode collapse. Human annotators systematically favour conventional text due to cognitive tendencies, leading aligned models to prioritise typical responses even when many high-quality options exist.
The team analysed more than 70,000 social media posts from US senators during 2018 and multiple preference datasets to verify the bias. They found that human raters favoured responses more typical for base models independent of correctness.
“LLMs’ potentials are not fully unlocked yet,” said Weiyan Shi, an assistant professor at Northeastern University. “As shown in our paper, prompt optimisation can be guided by thinking about how LLMs are trained and aligned, and can be proved theoretically.”
Comprehensive experiments showed that Verbalized Sampling significantly improved performance across creative writing, dialogue simulation, open-ended question answering and synthetic data generation. In creative writing, the method increased diversity by 1.6-2.1 times over direct prompting and improved human evaluation scores by 25.7 per cent.
For story generation using the prompt “Without a goodbye”, direct prompting produced formulaic breakup scenes while Verbalized Sampling yielded narratives involving cosmic events, silent emails and music stopping mid-dance. The method recovered 66.8 per cent of the base model’s original diversity after alignment training, compared to just 23.8 per cent retention with direct prompting.
The researchers tested the method on models including GPT-4.1, Gemini-2.5-Pro, Claude-4-Sonnet and Llama-3.1-70B-Instruct. The method proved model-agnostic and required no access to model internals or additional training.
Larger models showed greater gains from Verbalized Sampling, with diversity improvements 1.5 to two times stronger than smaller models. For synthetic data generation, the method improved downstream performance on mathematics tasks, with fine-tuned models achieving 37.5 per cent average accuracy compared to 30.6 per cent using direct prompting.
The method allows users to tune diversity by adjusting probability thresholds in the prompt without changing decoding parameters. The researchers released the method as a Python package with integration for LangChain under an Apache 2.0 licence.