Thank you for the clarifications (see the comments to the question). Okay, I must say, the assignment is far from simple. I don't really believe in adequacy of the teacher: if you can give such challenging assignments to the high school students, don't teach stupidities as VB, teach some serious programming. (Before answering this question, I checked up the Israel
education system information to get a confirmation for myself: "high school" really means students of 10-12 grades, typically 15 to 18 years old. It means people who did not start advanced mathematics, only leaned school algebra and geometry, may be basics of calculus, without the level of strictness accepted in this field.)
The problem is quite solvable, but… The main goal of the solution is to make the recognition of the "hardcopy" image more or less reliable. Of course, printing of the text and, hence, OCR, would be out of question. I would apply something similar to a kind of a bar code. Some library for of the standard bar codes could be used, but I have no idea who would write such a library for "VB" or "Matlab". However, for many platforms and technologies it would be easy to find. For .NET, in particular, one could find a lot of solutions.
So, I would try to create a very simplified graphical code. Something like this: 1) draw a rectangular thick black frame, to define the scale and aspect ratio of the coded area, 2) subdivide the area inside the frame into several rows, and each row — into
8 * N
areas each representing a single byte, 3) subdivide the area of each byte into 8 areas representing bits; 4) serialize text into a series of bytes and then bits, for each set bit, paint the bit area in black.
The recognition should first determine the rectangle of the code area, using the frame. This is the most challenging part. In fact, this part could be skipped. One could assume that the user of this software can take the scanner, align the page very accurately, pre-scan it, manually select the area of the code precisely, and finally scan only the rectangular area where the code is. But the quality algorithm should determine the code area automatically. I think it would be the very advanced requirement and would not insist on it (after all, the assignment did not demand it strictly). Even more challenging requirement would be handling the cases where the paper is not accurately aligned. The commercial software, to be successful, would certainly need this feature.
Now, as to recognition of the bits: my idea is based on the fact that the location of each bit rectangular area is known in advance from the size of the frame (or total size of all the code). The aspect ratio of the bit areas and number of bytes per line (
N
, see above) should be uses as a preexisting knowledge, part of the code conventions. The number of rows (N-byte fragments) should be determine during recognition. One useful option would be to reserve the first row to declare the length of the coded text, in unsigned byte (or other unsigned integer) representation.
This way, the recognition of each byte would not involve the search for its location. It would be enough just to determine the color. It should never be one pixel. The solution should take a set of pixels inside expected bit area, average all those pixels and tell white from black. For example, in the single-byte-per-pixel bitmap encoding, one should compare the average pixel value with 128. If it is more than that, the color is white (the bit is set, 1), otherwise it is black (the bit is cleared, 0), or visa versa.
This assignment will provide yet another test, the test in honesty. Will the student present his work with the reference to this CodeProject page or not? This is what a decent student should do.
—SA