Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory and the Toyota Research Institute have developed a system that creates diverse, physically accurate virtual environments for training robots to handle household tasks.
The “steerable scene generation” approach produces digital scenes of kitchens, living rooms and restaurants where engineers can simulate real-world interactions and scenarios. Trained on over 44 million 3D rooms filled with models of objects such as tables and plates, the tool places existing assets in new scenes before refining each into a physically accurate environment.
The system addresses a fundamental challenge in robot training: whilst chatbots like ChatGPT learned from billions of text data points across the internet, robots need visual demonstrations showing how to handle, stack and place various arrangements of objects across diverse environments. Collecting these demonstrations on real robots proves time-consuming and not perfectly repeatable, whilst previous AI-generated simulations often failed to reflect real-world physics.
Steerable scene generation creates 3D worlds by “steering” a diffusion model, an AI system that generates visuals from random noise, toward scenes found in everyday life. The system “in-paints” an environment, filling in particular elements throughout the scene whilst ensuring physical accuracy, such as preventing a fork from passing through a bowl on a table.
The tool’s main strategy employs Monte Carlo tree search, where the model creates a series of alternative scenes with different arrangements toward a particular objective. Nicholas Pfaff, MIT Department of Electrical Engineering and Computer Science PhD student and lead author, explained: “We are the first to apply MCTS to scene generation by framing the scene generation task as a sequential decision-making process.” The approach builds on partial scenes to produce better results over time, creating scenes more complex than what the diffusion model was trained on.
In one experiment, the system added the maximum number of objects to a simple restaurant scene, featuring as many as 34 items on a table including massive stacks of dim sum dishes, after training on scenes with only 17 objects on average.
The tool accurately followed users’ prompts at rates of 98 per cent when building scenes of pantry shelves and 86 per cent for messy breakfast tables. Both marks represent at least a 10 per cent improvement over comparable methods like MiDiffusion and DiffuScene.
Users can prompt the system directly by typing specific visual descriptions, or request it to complete specific scenes by filling in empty spaces whilst preserving the rest of an environment. The system can also generate diverse training scenarios via reinforcement learning, teaching a diffusion model to fulfil an objective through trial and error.
Pfaff noted: “A key insight from our findings is that it’s OK for the scenes we pre-trained on to not exactly resemble the scenes that we actually want.” Using the steering methods, the team can move beyond that broad distribution to generate the diverse, realistic and task-aligned scenes actually needed for robot training.
The researchers acknowledge their work remains a proof of concept. Future plans include using generative AI to create entirely new objects and scenes instead of using a fixed library of assets, and incorporating articulated objects that robots could open or twist.
Russ Tedrake, Toyota Professor of Electrical Engineering and Computer Science, Aeronautics and Astronautics, and Mechanical Engineering at MIT, served as senior author. The work was supported by Amazon and the Toyota Research Institute, with findings presented at the Conference on Robot Learning in September.