Robot teaches itself to sing by watching YouTube and staring in a mirror

The days of creepy, lifeless robotic stares may be numbered thanks to a machine that learns to smile and sing just like a human baby: by making faces at itself in the mirror.

Columbia University engineers have unveiled a breakthrough in android design that could finally help robots cross the “Uncanny Valley”, the unsettled feeling humans get when a machine looks almost, but not quite, alive.

In a study published in Science Robotics, the team unveiled a robot that taught itself to lip-sync by observing its own reflection and watching hours of YouTube videos, rather than following pre-programmed code.

The result is a machine capable of articulating words in multiple languages and even singing songs from its own AI-generated debut album, hello world_.

The end of ‘Muppet mouth’

For decades, even the most advanced humanoids have suffered from a fatal flaw: they might walk like us, but they talk like Muppets. Their mouths simply open and close in a rhythmic flapping motion that fails to match the complexity of human speech.

“We humans attribute outsized importance to facial gestures in general, and to lip motion in particular,” said Hod Lipson, a professor of innovation at Columbia and director of the Creative Machines Lab.

“While we may forgive a funny walking gait or an awkward hand motion, we remain unforgiving of even the slightest facial malgesture… Robots oftentimes look lifeless, even creepy, because their lips don’t move. But that is about to change.”

The mirror test

The secret to the robot’s success is “observational learning.” Instead of engineers writing thousands of lines of code to dictate every lip twitch, the robot was forced to figure it out for itself.

The process began with a mirror. Equipped with a soft, flexible silicone skin and 26 internal facial motors, the robot spent hours watching its own reflection. Like an infant making faces to test its muscles, it performed thousands of random expressions to understand how its hardware worked.

Once it built a mental model of its own face, the robot watched videos of humans talking and singing. By cross-referencing the human movements with its own self-knowledge, it learned to translate audio signals directly into complex lip movements.

Crossing the valley

The team tested their creation on a variety of complex tasks. Without being told the words’ meanings, the robot successfully synced its lips to clips of speech in different languages and to fast-paced songs.

However, the technology is not yet perfect. The researchers acknowledged that the robot still struggles with “hard” sounds like ‘B’ and with phonemes that require lip puckering, such as ‘W’.

“But these abilities will likely improve with time and practice,” Lipson promised. “The more it interacts with humans, the better it will get.”

The missing link

The implications extend far beyond aesthetics. Yuhang Hu, the PhD student who led the study, argues that realistic facial affect is the “missing link” in robotics, essential for building trust in sectors like elder care and education.

“When the lip sync ability is combined with conversational AI such as ChatGPT or Gemini, the effect adds a whole new depth to the connection the robot forms with the human,” Hu explained.

With economists predicting that more than a billion humanoid robots could be manufactured over the coming decade, the need for a friendly, non-threatening face is becoming an economic imperative.

“There is no future where all these humanoid robots don’t have a face,” Lipson said. “I’m a jaded roboticist, but I can’t help but smile back at a robot that spontaneously smiles at me.”