Disclaimer: this is not an answer, because it is impossible to give any definitive answer.
My answer is a (very tentative) draft for a whole research program:
1) Develop conversion of all files into identical raw format (see answer by Andrew).
2) Develop Fourier analysis for a arbitrary
fragment of audio, see
http://en.wikipedia.org/wiki/Fourier_analysis[
^],
http://en.wikipedia.org/wiki/Fast_Fourier_transform[
^].
3) Develop comparison criteria for Fourier images with weights such as length of the fragment.
4) Develop "tokenization" if the whole audio piece: a way to break up all audio stream into several "distinct" fragments; with the requirement that the sub-fragments withing a "token" fragments would be relatively "close" compared to the nearby fragments. Attention! This is most difficult part. Prepare to learn
fuzzy sets or
genetic programming or something like that used in image recognition.
5) You have tokenized audio fragments. Try to match fragments from different audio streams, score good matches with weights.
6) Compare the scores, present result.
7) Develop optional criteria. One criterion may put more weight on high frequencies, another on tempo/duration, etc.
Even in most successful case, I am not expecting good result for audio performing, for example, same music opus played by different instruments, or, say, different singers.
Even if you get limited success, prepare yourself for major awards in computer science. :)
Now, in real life this is a pretty important problem. I remember announcement on Boston Craig's List. A "patient" offered considerable fee to anyone who would sort tons of his music records, with elimination of duplicates (important!), merging tags, descriptions, etc. He did not care manually or through programming but hoped for development of the technology (pretty naive hope, but...). Now imaging you love music (like I do) -- that would mean a real torture while listening a lot of... well, different records. If you don't care, chances are you would fail to recognize the tunes... Makes sense, right?
[EDIT]
Please pay attention for the Answer by Espen Harlinn: he was able to go much deeper then I did.
See also my comment to that Answer.