Sorry, your questions and comments (please see above) suggest that such problem is
well above your head. You need to pick much simpler assignment.
Let's start from the input which is other than MIDI. The input is the digitized (sampled) dependency between the current in the coil of the speaker and time. In this output, there are no notes per se, there are no even frequencies. In the real-like musical samples, the situation is even very far from a set of mixed frequencies. It's a mess of different sounds and noises and virtually none of them sound for a prolonged period of time so you could simply analyze the set of frequencies. Even it the sounds were not digitized and you already had an instrument of perfect spectrum analysis, you would not have a spectrum with a finite set of frequencies and phases. Instead, you would get a continues spectrum, without discrete frequencies. This is a result of well-known theory of Fourier transform. Even if you try to create a perfect sine sound, it will have a continues spectrum with infinite set of frequency if you try to limit it in time. Digitization makes this problem more difficult. You need to try to recognize all this mess into a set of frequencies corresponding to the pure musical tones of equal temperament (or any other system). You need to get a picture abstracted from a lot of detail, suppress/ignore the noises, etc. Theoretically, this is not always possible (not every noise could be interpreted as musical). And even when it is possible from the point of view of human perception, it is extremely difficult. This is a very difficult combination of Fourier Analysis and image recognition,
http://en.wikipedia.org/wiki/Fourier_analysis[
^],
http://en.wikipedia.org/wiki/Image_recognition#Recognition[
^].
Are you familiar with any of these fields? Each is a whole piece of education, each is much more than reading of some articles and even a book. And even if you are educated in this fields, this is not enough to approach this problem. A while ago I tried several pieces of Open Source software trying to solve this problem and found that their quality is very poor, they could analyze only very simple record which a lot of errors. I can imaging that a very high quality product might exist and work well on many non-trivial samples, but that should really be a top-notch technology. You don't realize it, not even close, as you keep talking about "requirement" and "A, B, C#…".
The opposite end is MIDI. MIDI sequence of file has no sounds. It is practically already composed of notes. More exactly, this is description of the sequence of events. Imagine the description of piano play. Each MIDI event essentially describes which a piano key is pressed or released at what time and how loudly it is played. The whole play combines several instruments playing at the same time and can include more complex detail such as
bending (like with a guitar or an electronic piano wheel), percussion and more. All you need to do is just the knowledge of MIDI format and ability to parse the file; it also requires basic knowledge of musical theory, just the trivial part of it. This problem is nothing compared to the recognition problem described above.
I have a feeling that I waste my time. I only hope some reasonable readers could find this elementary introduction to the problem interesting.
—SA