MIT scientists finally reveal how your brain solves the cocktail party problem

Picture yourself in a crowded, deafening room, yet you can effortlessly tune out the roaring background noise to focus entirely on the friend standing right in front of you. Neuroscientists have puzzled over this incredible feat of auditory focus — known as the “cocktail party problem” — for decades.

Now, researchers at the Massachusetts Institute of Technology (MIT) have finally cracked the code, mapping exactly how the human brain isolates a single voice from a chaotic cacophony of sound.

In a study published in the journal Nature Human Behavior, the MIT team revealed that the brain solves this complex issue by amplifying the activity of neural processing units that respond to specific features of a target voice, such as its pitch.

“That simple motif is enough to cause much of the phenotype of human auditory attention to emerge, and the model ends up reproducing a very wide range of human attentional behaviours for sound,” said Josh McDermott, a professor of brain and cognitive sciences at MIT and the study’s senior author.

The power of “multiplicative gains”

Neuroscientists have long observed that when animals or humans focus on a specific sound, neurons tuned to that sound’s features amplify their activity. This process scales a neuron’s firing rate upward in what is known as “multiplicative gains.” Conversely, neurons that are not tuned to the target feature experience a reduction in activity.

However, until now, it was entirely unclear if this biological scaling effect was sufficient on its own to explain how we selectively attend to one voice over another.

To answer this, the researchers — including lead author Ian Griffith, a graduate student in the Harvard Program in Speech and Hearing Biosciences and Technology, and MIT graduate student R. Preston Hess — built a computational model of the auditory system. Previous computational models failed at these types of attentional tasks because they could not pick one voice out of many competing stimuli.

The MIT team modified an existing neural network, allowing each of its stages to actively implement multiplicative gains. Under this architecture, the activation of processing units can be boosted up or down depending on the specific features they represent.

Mirroring human mistakes

During testing, the researchers first fed the model an audio “cue” of the target voice. The cue’s features determined the multiplicative gains for the next stimulus.

“Imagine the cue is an excerpt of a voice that has a low pitch,” Griffith explained. “Then, the units in the model that represent low pitch would get multiplied by a large gain, whereas the units that represent high pitch would get attenuated.”

The model was then fed a complex mix of voices and asked to identify a specific word said by the target voice. The computational model performed remarkably like a human listener across a wide range of conditions. It even made the exact same mistakes as humans, such as struggling to separate two female voices or two male voices due to their similar pitches.

An engine for discovery

The researchers also used the model to explore how spatial location impacts auditory attention. Testing every possible combination of target and distractor locations — a task that would be far too time-consuming with human subjects — the model revealed a completely new property of human spatial attention.

The model successfully isolated target voices when the distractor was separated horizontally, but it struggled significantly when the sounds were separated vertically. When the researchers ran a subsequent experiment with real human subjects, they observed the exact same limitation.

Moving forward, the team hopes to apply this computational model to simulate listening through a cochlear implant. By understanding these fundamental attentional mechanisms, researchers aim to improve cochlear implants so that users can finally focus on specific voices in noisy, crowded environments.