AI breakthrough compresses chatbot memory without losing accuracy

Researchers at Seoul National University have developed AI technology that compresses the “conversation memory” of large language model-based chatbots by three to four times whilst achieving approximately two times faster response times, all without any loss in accuracy.

The new method, called KVzip, selectively retains only information useful for future questions, autonomously verifying and compressing memory for efficient reuse in long-context tasks such as extended dialogue and document summarisation. The findings were selected as an Oral Presentation, representing the top 0.35 per cent among 21,575 submissions to NeurIPS 2025, one of the world’s most prestigious conferences in artificial intelligence.

The term “conversation memory” refers to the temporary storage of sentences, questions and responses that a chatbot maintains during interaction, which it uses to generate contextually coherent replies. Using KVzip, a chatbot can compress this memory by eliminating redundant or unnecessary information that is not essential for reconstructing context.

“KVzip is significant in that it enables reusable compressed memory that retains only the most essential information, even in LLM agents requiring long contextual understanding,” stated Professor Hyun Oh Song from the Department of Computer Science and Engineering, who led the research team.

Dialogue, coding and question

Modern LLM chatbots perform tasks such as dialogue, coding and question answering using enormous contexts that can span hundreds or even thousands of pages. As conversations grow longer, however, the accumulated conversation memory increases computational cost and slows down response time.

Most existing compression techniques are query-dependent, meaning they optimise memory only for the current question. When a new or follow-up question is asked, the chatbot’s performance typically deteriorates significantly. KVzip overcomes this limitation by performing compression that retains only the information necessary for context reconstruction, enabling the chatbot to handle multiple future queries without needing to recompress its memory each time.

In a wide range of tasks, including question answering, retrieval, reasoning, and code understanding, KVzip achieved a three to four times memory reduction and approximately twice the response time without any loss in accuracy. The technique also demonstrated scalability to extremely long contexts of up to 170,000 tokens using major open-source LLMs such as Llama 3.1, Qwen 2.5 and Gemma 3.

“KVzip can be seamlessly applied to real-world LLM applications and on-device systems to ensure consistent quality and improved speed for long-context interactions,” stated Dr Jang-Hyun Kim, the main contributor of the project who will join the AI/ML Foundation Models team at Apple as a machine learning researcher.

The technology has been integrated into NVIDIA’s open-source KV cache compression library, KVPress, making it readily accessible for practical deployment. By reducing memory usage by three to four times and shortening response latency by about two times, the method allows servers to handle more concurrent users and longer conversations whilst significantly lowering operating costs.