MIT CSAIL creates realistic virtual kitchens and living rooms where simulated robots can interact with models of real-world objects
Photo credit: MIT CSAIL

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory and the Toyota Research Institute have developed a system that creates diverse, physically accurate virtual environments for training robots to handle household tasks.

The “steerable scene generation” approach produces digital scenes of kitchens, living rooms and restaurants where engineers can simulate real-world interactions and scenarios. Trained on over 44 million 3D rooms filled with models of objects such as tables and plates, the tool places existing assets in new scenes before refining each into a physically accurate environment.

The system addresses a fundamental challenge in robot training: whilst chatbots like ChatGPT learned from billions of text data points across the internet, robots need visual demonstrations showing how to handle, stack and place various arrangements of objects across diverse environments. Collecting these demonstrations on real robots proves time-consuming and not perfectly repeatable, whilst previous AI-generated simulations often failed to reflect real-world physics.

Steerable scene generation creates 3D worlds by “steering” a diffusion model, an AI system that generates visuals from random noise, toward scenes found in everyday life. The system “in-paints” an environment, filling in particular elements throughout the scene whilst ensuring physical accuracy, such as preventing a fork from passing through a bowl on a table.

The tool’s main strategy employs Monte Carlo tree search, where the model creates a series of alternative scenes with different arrangements toward a particular objective. Nicholas Pfaff, MIT Department of Electrical Engineering and Computer Science PhD student and lead author, explained: “We are the first to apply MCTS to scene generation by framing the scene generation task as a sequential decision-making process.” The approach builds on partial scenes to produce better results over time, creating scenes more complex than what the diffusion model was trained on.

In one experiment, the system added the maximum number of objects to a simple restaurant scene, featuring as many as 34 items on a table including massive stacks of dim sum dishes, after training on scenes with only 17 objects on average.

The tool accurately followed users’ prompts at rates of 98 per cent when building scenes of pantry shelves and 86 per cent for messy breakfast tables. Both marks represent at least a 10 per cent improvement over comparable methods like MiDiffusion and DiffuScene.

Users can prompt the system directly by typing specific visual descriptions, or request it to complete specific scenes by filling in empty spaces whilst preserving the rest of an environment. The system can also generate diverse training scenarios via reinforcement learning, teaching a diffusion model to fulfil an objective through trial and error.

Pfaff noted: “A key insight from our findings is that it’s OK for the scenes we pre-trained on to not exactly resemble the scenes that we actually want.” Using the steering methods, the team can move beyond that broad distribution to generate the diverse, realistic and task-aligned scenes actually needed for robot training.

The researchers acknowledge their work remains a proof of concept. Future plans include using generative AI to create entirely new objects and scenes instead of using a fixed library of assets, and incorporating articulated objects that robots could open or twist.

Russ Tedrake, Toyota Professor of Electrical Engineering and Computer Science, Aeronautics and Astronautics, and Mechanical Engineering at MIT, served as senior author. The work was supported by Amazon and the Toyota Research Institute, with findings presented at the Conference on Robot Learning in September.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Political misinformation key reason for US divorces and breakups, study finds

Political misinformation or disinformation was the key reason for some US couples’…

Meta launches ad-free subscriptions after ICO forces compliance changes

Meta will offer UK users paid subscriptions to use Facebook and Instagram…

Wikimedia launches free AI vector database to challenge Big Tech dominance

Wikimedia Deutschland has launched a free vector database enabling developers to build…

Walmart continues developer hiring while expanding AI agent automation

Walmart will continue hiring software engineers despite deploying more than 200 AI…

Anthropic’s Claude Sonnet 4.5 detects testing scenarios, raising evaluation concerns

Anthropic’s latest AI model recognised it was being tested during safety evaluations,…

Film union condemns AI actor as threat to human performers’ livelihoods

SAG-AFTRA has condemned AI-generated performer Tilly Norwood as a synthetic character trained…

Mistral targets enterprise data as public AI training resources dry up

Europe’s leading artificial intelligence startup Mistral AI is turning to proprietary enterprise…

UK creates commission to make NHS world’s most AI-enabled health system

The UK government has established a new National Commission, bringing together clinical…