Paul May

More Audio Feature Extraction

07 August 2016

I’m making incremental progress on my listening robot project. The ultimate goal is to build a little machine capable of listening to its surroundings, learning from the sound it hears, and then - later - being able to determine its location based upon what it hears. A reasonable prototype robot might be able to remember 5-10 locations, and then tell those locations apart when asked to do so.

There are a number of ways of approaching the software part of the project. I’ve chosen to go down a classical, supervised machine learning route - extracting features from sound, then using them to fit a machine learning model. In essence, this means taking raw sound, and noticing particular qualities of the sound. These qualities are associated with a label - say the name of the location - and used to train a machine learning model.

My work in the last week has been to implement simple feature extraction techniques demonstrated in off-the-shelf audio information extraction libraries Librosa and pyAudioAnalysis. I have been passing in short snippets of music into their feature extraction functions, to output specific features of that sound such as the spectral centroid; a measure of the “brightness of a sound”, and the spectral contrast; a measure of the difference between peaks and valleys in the sound.

In software, I’m implementing feature extractor classes that can be passed some sound, returning feature representations of the sound. I’m trying to get the structure of this right, up front, so I don’t have a lot of refactoring and noodling to do later. I’m also trying to be somewhat agnostic of the underlying library, so that they can be mixed/swapped as needs be.

I’m getting close to just piping feature data into some dumb classifier; getting to a crude proof of concept. I’m not sure whether information extraction techniques more geared towards understanding music are useful for understanding ambient sound - if they’re not, then I may need to adopt another approach.

Related Images

Rihanna track - The spectral centroid fluctuates in line with the beat, peaking on the snare drum/white noise
Rihanna track - The spectral contrast of the track. I'm not sure if this a useful representation of the sound - more reading to do
Mozart - The spectral centroid is much more even, consistent than the Rihanna track. The brightness of the sound tracks the notes played on the piano, peaking on the highest and brightest note.
Mozart - The spectral contrast of the track. Clearly different from the Rihanna track, but a useful way to represent the sound? I'm not so sure.
  • {{text}}
    {{date}}
  • Paul May is a researcher, interaction designer, and technologist from Dublin, Ireland. He is currently working with Memorial Sloan Kettering Cancer Center on smart health applications.