Extracting Audio from Visual Information
This month a group of researchers based at MIT successfully demonstrated an algorithm that can capture audio information using video content alone. By analysing the tiny vibrations of objects caused by nearby sound waves, the team was able to translate these to recover the sounds causing these vibrations.
“When sound hits an object, it causes the object to vibrate. The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”
– Abe Davis, Graduate Student at MIT
In a series of incredible experiments, the team was able to recover intelligible speech from video clips of a packet of crisps, aluminium foil, the surface of a glass of water and the leaves of plant, 15 feet away and through soundproof glass.
Although many of the experiments used high-speed cameras (2,000 to 6,000 frames per second) to capture video with a high enough frequency to decipher speech, some experiments used an ordinary digital camera (60 frames per second) that records similar footage to your typical smart phone. Although the audio was not as clear, it was still possible to identify the number of speakers in the room and their genders, and given enough prior information about the acoustic properties of a speaker’s voice, even their identity. Given the rate at which video imaging is improving in your average consumer device, the future applications for this technology in law enforcement and security are not difficult to imagine, and are not far away.
“This is totally out of some Hollywood thriller. You know that the killer has admitted his guilt because there’s surveillance footage of his potato chip bag vibrating.”
– Alexei Efros, Associate Professor at UC Berkeley
For further information and a video of the technique in action see: http://newsoffice.mit.edu/2014/algorithm-recovers-speech-from-vibrations-0804