Reduced LLM memories.
Photo credit: theFreesheet/Google ImageFX

Researchers at Seoul National University have developed AI technology that compresses the “conversation memory” of large language model-based chatbots by three to four times whilst achieving approximately two times faster response times, all without any loss in accuracy.

The new method, called KVzip, selectively retains only information useful for future questions, autonomously verifying and compressing memory for efficient reuse in long-context tasks such as extended dialogue and document summarisation. The findings were selected as an Oral Presentation, representing the top 0.35 per cent among 21,575 submissions to NeurIPS 2025, one of the world’s most prestigious conferences in artificial intelligence.

The term “conversation memory” refers to the temporary storage of sentences, questions and responses that a chatbot maintains during interaction, which it uses to generate contextually coherent replies. Using KVzip, a chatbot can compress this memory by eliminating redundant or unnecessary information that is not essential for reconstructing context.

“KVzip is significant in that it enables reusable compressed memory that retains only the most essential information, even in LLM agents requiring long contextual understanding,” stated Professor Hyun Oh Song from the Department of Computer Science and Engineering, who led the research team.

Dialogue, coding and question

Modern LLM chatbots perform tasks such as dialogue, coding and question answering using enormous contexts that can span hundreds or even thousands of pages. As conversations grow longer, however, the accumulated conversation memory increases computational cost and slows down response time.

Most existing compression techniques are query-dependent, meaning they optimise memory only for the current question. When a new or follow-up question is asked, the chatbot’s performance typically deteriorates significantly. KVzip overcomes this limitation by performing compression that retains only the information necessary for context reconstruction, enabling the chatbot to handle multiple future queries without needing to recompress its memory each time.

In a wide range of tasks, including question answering, retrieval, reasoning, and code understanding, KVzip achieved a three to four times memory reduction and approximately twice the response time without any loss in accuracy. The technique also demonstrated scalability to extremely long contexts of up to 170,000 tokens using major open-source LLMs such as Llama 3.1, Qwen 2.5 and Gemma 3.

“KVzip can be seamlessly applied to real-world LLM applications and on-device systems to ensure consistent quality and improved speed for long-context interactions,” stated Dr Jang-Hyun Kim, the main contributor of the project who will join the AI/ML Foundation Models team at Apple as a machine learning researcher.

The technology has been integrated into NVIDIA’s open-source KV cache compression library, KVPress, making it readily accessible for practical deployment. By reducing memory usage by three to four times and shortening response latency by about two times, the method allows servers to handle more concurrent users and longer conversations whilst significantly lowering operating costs.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Journalism schools lack consistent AI strategy as scattershot policies confuse

Artificial intelligence is becoming deeply embedded in journalistic workflows, yet new research…

AI uses rapid facial ageing to predict cancer survival chances

When battling cancer, the speed at which your face physically ages could…

Lower-income nations lead the world in digital health literacy

It is a common assumption that national wealth automatically translates into stronger…

AI chatbots lose up to 30 per cent accuracy when trained to be friendly

Training chatbots to sound warmer and more empathetic makes them significantly less…

AI ‘photo booth’ reads the faces of lab mice to detect their hidden pain

Assessing pain in laboratory mice is notoriously difficult, often relying on subjective…

Your AI chatbot addiction is a deliberate corporate design, exploiting loneliness

Millions of people are developing severe, life-altering addictions to artificial intelligence chatbots…