Paul May

Extracting Information from Sound

31 July 2016

I’m working on a new information retrieval and machine learning project - but unlike previous projects that involved large amounts of text, this project involves sound.

My goal is to create a small, self-contained robot that can listen to, and learn from its surroundings - building up sonic fingerprints of a number of locations. Later, it should be possible for the robot to tell what location it’s at, simply by listening, and comparing it what it hears to previous experiences.

I’m just getting started with the project, and my focus is on writing simple software that can record sound, extract features, fit machine learning models, then classify previously unheard sound.

The field of audio/music information retrieval is very well-developed, so I have a lot to read and learn from.

As a simple hello-world, I took two existing sound files; a Mozart piano sonata, and a recent track by Rihanna, and visualized them as chromagrams. Visually, the difference between the two tracks is clear, and this might be a clue that the notes used in a piece of music represent useful features for a machine learning model. I’ve yet to discover if this is the case, or if chromatic features are useful for more ambient sound.

It’s worth noting that, for now, I’m foregoing any thought of using deep learning techniques to create vector representations of sound, for use in classification tasks. I might try to tackle this down the road. Crawl, walk, run etc.

I’ll write up what I find as the project continues.

Related Images

The chromagram of a Mozart piano sonata. The chromagram is a time series (time runs left to right on the x-axis) that breaks sound down into the 12 notes in the well-tempered, western, scale - the 12 notes you find in an octave on a piano. Visually, it's possible to pick out the more common notes in the sonata.
The chromagram of a recent Rihanna track. The track is longer, and hence more dense on the x-axis, but it's possible to tell that the sound is also much denser chromatically, on the y-axis; there are more notes being played at any given time; more polyphony, more chords, more instruments. The more common notes look very different from the Mozart sonata
The Mozart sonata trimmed to 1 minute. Notice the section at around 30s where the repeating structure of earlier bars breaks down, and other tones come into play.
The Rihanna track trimmed to 1m. Like the Mozart track, there's a point just after 30s where the repeating structure breaks down - this isn't a bridge, but a number of instruments dropping out, before coming back in again. The track is incredibly repetitive, pretty much from start to finish, save for this patch and the last couple of bars.
  • {{text}}
  • Paul May is a researcher, interaction designer, and technologist from Dublin, Ireland. He is currently working with Memorial Sloan Kettering Cancer Center on smart health applications.