recognising cats with ai
Photo credit: MIT

Researchers from MIT and the MIT-IBM Watson AI Lab have developed a training method that teaches vision-language models to locate personalised objects, addressing a fundamental weakness in current AI systems.

Vision-language models like GPT-5 excel at recognising general objects such as dogs, but perform poorly when asked to identify a specific pet among other animals. The new technique uses video-tracking data showing the same object across multiple frames, forcing models to rely on contextual clues rather than memorised knowledge.

The researchers structured their dataset so models must focus on visual context to identify personalised objects. They discovered that models attempt to bypass this challenge by using pretrained knowledge, prompting the team to employ pseudo-names in the training data. Instead of labelling a tiger as “tiger”, they used names like “Charlie” to prevent the model from relying on previously learned associations.

“It took us a while to figure out how to prevent the model from cheating. But we changed the game for the model. The model does not know that ‘Charlie’ can be a tiger, so it is forced to look at the context,” says Jehanzeb Mirza, an MIT postdoc and senior author of a paper on this technique.

Models retrained with the new dataset improved accuracy at personalised localisation by approximately 12 per cent on average. When combined with pseudo-name labelling, performance gains reached 21 per cent. Larger models showed greater improvements.

The technique maintains the model’s general capabilities whilst adding personalised object recognition. Potential applications include tracking specific items over time, ecological monitoring of animal species, and assistive technologies for visually impaired users.

The research revealed an unexpected finding: vision-language models do not inherit in-context learning capabilities from their base large language models, despite theoretical expectations. The team plans to investigate why this capability fails to transfer between the two model types.

The work will be presented at the International Conference on Computer Vision.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Journalism schools lack consistent AI strategy as scattershot policies confuse

Artificial intelligence is becoming deeply embedded in journalistic workflows, yet new research…

AI uses rapid facial ageing to predict cancer survival chances

When battling cancer, the speed at which your face physically ages could…

Lower-income nations lead the world in digital health literacy

It is a common assumption that national wealth automatically translates into stronger…

AI chatbots lose up to 30 per cent accuracy when trained to be friendly

Training chatbots to sound warmer and more empathetic makes them significantly less…

AI ‘photo booth’ reads the faces of lab mice to detect their hidden pain

Assessing pain in laboratory mice is notoriously difficult, often relying on subjective…

Your AI chatbot addiction is a deliberate corporate design, exploiting loneliness

Millions of people are developing severe, life-altering addictions to artificial intelligence chatbots…