Audio File Matching Program

Question

4.20/5 (3 votes)

See more:

I'm trying to write a program than can take two audio files (e.g. WAV) as inputs, compare them, and spit out a number that tells you how similar the audio files are.

If someone has done something like this, know how to go about doing it, or just have some ideas, please let me know. Anything will be greatly appreciated.

Specific questions: What language is suitable? How hard is it to do (how many
hours, roughly)? Where can I find a good source of audio library/tools?

Thanks!

Posted 21-Jan-11 1:05am

Member 3107204

Add a Solution

Comments

Sergey Alexandrovich Kryukov 21-Jan-11 14:56pm

Excellent question, would make a great project, very complex.
None of the answers approach the goal -- no wonder.
--SA

5 solutions

Solution 5

One approach could be to create an audio fragment recognition/matching system based on Hidden Markov Models[^] (HMM), where the audio is considered a piecewise signal.

This is an elaboration on SAKryukov's answer, that I think may have some relevance to 3), 4) and 5)

Update:
As with most things - it turns out that somebody wrestled with part of the problem before:
Greed is Good: Algorithmic Results for Sparse Approximation[^], allows you to retrieve a sparse signal, an optimized minimal representation of the input signal. I imagine this could be used to process the inputs to the system, generating better "pieces" for the "dictionary" that could serve as the foundation for the dynamically generated HMM.

This approach can possibly be used to recognize sequences of pieces from the first stream in the second stream.

There is a conceptual similarity to a bottom up parser where one stream is used to construct the "grammar" - please note the "'s, and you'll have to deal with probable matches defining multiple possible solution paths, and then use that information to determine if you have something you want to consider a match or not. I envision some kind of state machine/automata where the branches are evaluated according to how probable the next state is. This state machine is generated based on what can be “learned” from one of the streams and used during matching against the other stream.

HMMs have found their use in speech recognition systems, so maybe they can be used to implement what you are trying to achieve - the hard part will be to formulate one of the input streams as a HMM - "hard" as in I don't think this has been tried before, but other uses of HMM indicates that this may be a feasible approach.

To get the coefficient vectors for the HMMs, calculate the Fourier transform for a piece of the signal, as proposed by SAKryukov, and then select the most significant coefficients - those that contribute the most the the "shape" of the signal as input to the HMM.

This is so oversimplified, I'm just trying to point you in the direction of a possible solution.

It will probably be quite hard to automatically determine the statistical significance of the decomposed input to the HMM, but I guess a something can be learned from research on "part of speech" recognition systems, including how to "train" them.

This is just my 2ct's worth on one possible approach to the development of SAKryukovs correlation method. This is as SAKryukov says, research material :)

What language is suitable? c/c++, you'll find some high quality math/numeric libraries around the net.

How hard is it to do (how many hours, roughly)? Starting from scratch? Well, maybe a couple of years for some really dedicated scientists and developers.

Where can I find a good source of audio library/tools? sourceforge.net gives you a lot of options - but as far as I know, none that will provide more than the "trivial" parts of your solution.

Regards
Espen Harlinn

Posted 23-Jan-11 4:36am

Espen Harlinn

Updated 23-Jan-11 7:43am

v5

Comments

Sergey Alexandrovich Kryukov 23-Jan-11 12:36pm

Espen, I think you're much more qualified to this matter than I am (I also remember your adding note about wavelets; wavelets might be more relevant to this problem rather them relatively simple oscilloscope application).
My respect.

As to the expected success of the development, this is questionable. Main problem is not even technology but some kind of science (in particular, in human recognition): what criteria of closeness should match intuitive human, what criteria make sense for different application, how humans can be classified into different types by their recognition. I have learned a wonderful lesson on how much different people are, but this is quite a different story.

(5 of course, who would doubt...)
--SA

Espen Harlinn 23-Jan-11 13:11pm

Thanks SAKryukov - remember I've taken a look at your resume, I imagine you tackled some pretty complicated scenarios at http://www.inr.ac.ru/ - and other places :) - once upon a time I imagined that doing signal processing and related hardware would be what I would do for a living, turns out I was wrong – but I do remember a few things … Given that anything we present on this issue is a vast oversimplification - the paper http://www.math.princeton.edu/tfbb/spring03/greed-ticam0304.pdf, seems to justify part of the approach you outlined - but obviously in more detail. ( The parts the paper doesn't justify aren't dealt with at all )

Sergey Alexandrovich Kryukov 23-Jan-11 19:01pm

Great, Espen, thanks you for this reference.
I am a physicist but most of the time spent on software architecture and implementation and practically all related stuff.

Solution 6

I found a good article regarding your question
Duplicate songs detector via audio fingerprinting[^]

Posted 28-Sep-11 1:46am

avlaw

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Andrew Brock · Accepted Answer · 2011-01-21T01:45:00

Since you have listed your platform as iPhone, you have the exciting choice of C, C++ and Objective C programming languages, which are mostly the same thing and can be mixed and matched pretty easily.

This is good, because most open source projects are written in C/C++.

The basic method would be what you'd expect:
Decode music data (WAV is pretty much raw data enclosed in headers)
Analyse the raw data.

I don't know of any decoders for the iPhone, but I read somewhere that LAME[^] works, but is not optimised for it.

As for the analysis, it depends on how dynamic you want it to be.
A binary compare would be a bad choice, if you compare 2 songs the same, 1 starts just 0.1s later than the other then they will hardly match.
A better method would be something like beat mapping (at least as a first pass) to find interesting points in the song and try to match them together.

EDIT: I only provided this answer because no one else has yet. I'm no professional in the field and there may be better ways about doing this.

Emilio Garavaglia · Accepted Answer · 2011-01-21T02:16:00

Depends on what do you mean with "similar".
A binary compare (even with the only raw data) doesn't work, because a same music played in different executions is not necessarily made by the same identical "waves".

A method that can give you some success is try to guess suitable periods (a low pass filter can help you to identify a "beat") then, for each period estimate the spectrum, find the correlation function between the homologous spectrum and find how much it is different form an "impulse" function.

That's not "easy math" (google may help you, but lot of theory is needed).

Another empirical method (based on the fact that the "zip" algorithm is based on periodograms)can be the following:
keep "a.wav" and zip it (giving a.zip)
keep "b.wav" and zip it (gicing b.zip)

now zip a.zip and b.zip together.
If the length of the resulting zip is similar to the sum of the included zip nothing can be said.
If the length of the resulting zip is shorter (and similar to the longest of the two composing zip-s) than the two wav files are probably similar.

(The trick is that, if they are similar they will have similar compression patterns, that -when compressed together- will merge, otherwise will remain distinct, and placed one after the other)

[EDIT]
Oh, I forgot to mention that the real problem is the complexity (not only in the sense on numbers ...) of the algorithms.
The programming language, at that point, is just a detail.
(C will probably be the best "number cruncher")
[/EDIT]

Sergey Alexandrovich Kryukov · Accepted Answer · 2011-01-21T09:12:00

Disclaimer: this is not an answer, because it is impossible to give any definitive answer.

My answer is a (very tentative) draft for a whole research program:

1) Develop conversion of all files into identical raw format (see answer by Andrew).
2) Develop Fourier analysis for a arbitrary fragment of audio, see http://en.wikipedia.org/wiki/Fourier_analysis[^], http://en.wikipedia.org/wiki/Fast_Fourier_transform[^].
3) Develop comparison criteria for Fourier images with weights such as length of the fragment.
4) Develop "tokenization" if the whole audio piece: a way to break up all audio stream into several "distinct" fragments; with the requirement that the sub-fragments withing a "token" fragments would be relatively "close" compared to the nearby fragments. Attention! This is most difficult part. Prepare to learn fuzzy sets or genetic programming or something like that used in image recognition.
5) You have tokenized audio fragments. Try to match fragments from different audio streams, score good matches with weights.
6) Compare the scores, present result.
7) Develop optional criteria. One criterion may put more weight on high frequencies, another on tempo/duration, etc.

Even in most successful case, I am not expecting good result for audio performing, for example, same music opus played by different instruments, or, say, different singers.

Even if you get limited success, prepare yourself for major awards in computer science. :)

Now, in real life this is a pretty important problem. I remember announcement on Boston Craig's List. A "patient" offered considerable fee to anyone who would sort tons of his music records, with elimination of duplicates (important!), merging tags, descriptions, etc. He did not care manually or through programming but hoped for development of the technology (pretty naive hope, but...). Now imaging you love music (like I do) -- that would mean a real torture while listening a lot of... well, different records. If you don't care, chances are you would fail to recognize the tunes... Makes sense, right?

[EDIT]

Please pay attention for the Answer by Espen Harlinn: he was able to go much deeper then I did.
See also my comment to that Answer.

Audio File Matching Program

5 solutions

Solution 1

Solution 2

Solution 3

Solution 5

Solution 6

Add your solution here

Preview 0