Click here to Skip to main content
15,892,809 members
Please Sign up or sign in to vote.
2.00/5 (2 votes)
See more:
Hi,
I would like to convert text files into hardcopy and then convert them back by scanning (not using regular OCR).
the best way I could think of is converting the data into some kind of code (like QR Code or Data Matrix and spread it all over the A4 sheets. I dont want to use ocx/dll for these codes or very big source files).

Is there an EASY way to create such code that will do the work assuming no rotation and scaling corrections are needed and no compression.

I need a simple algorithem.

Thanks.
Posted
Comments
Sergey Alexandrovich Kryukov 12-Oct-13 22:29pm    
Why? why?!
By the way, a "hardcopy" could be something like a bar code of the text. Will it suit your? (I see no sense in it anyway, but perhaps you can explain it to us.)
—SA
Kobi_Z 13-Oct-13 7:05am    
My English is bad, hope you understand what I want:
Its a High school programming assignment for a friends son.
After learning about media types (old media like pierced paper sheets), they got an assignment to create their own media from paper, I suggested QR Code but the conversion module need to be written by the students (something simple but a bit more sophisticated then converting all data to plain binary).
The harcopy could be anything (no letters or numbers) printed on an A4 Sheet and later scanned and converted to the original text again.

Thanks.
Sergey Alexandrovich Kryukov 13-Oct-13 13:51pm    
Thank you for the clarification. Now, it is clear enough to answer the question.
—SA

Thank you for the clarifications (see the comments to the question). Okay, I must say, the assignment is far from simple. I don't really believe in adequacy of the teacher: if you can give such challenging assignments to the high school students, don't teach stupidities as VB, teach some serious programming. (Before answering this question, I checked up the Israel education system information to get a confirmation for myself: "high school" really means students of 10-12 grades, typically 15 to 18 years old. It means people who did not start advanced mathematics, only leaned school algebra and geometry, may be basics of calculus, without the level of strictness accepted in this field.)

The problem is quite solvable, but… The main goal of the solution is to make the recognition of the "hardcopy" image more or less reliable. Of course, printing of the text and, hence, OCR, would be out of question. I would apply something similar to a kind of a bar code. Some library for of the standard bar codes could be used, but I have no idea who would write such a library for "VB" or "Matlab". However, for many platforms and technologies it would be easy to find. For .NET, in particular, one could find a lot of solutions.

So, I would try to create a very simplified graphical code. Something like this: 1) draw a rectangular thick black frame, to define the scale and aspect ratio of the coded area, 2) subdivide the area inside the frame into several rows, and each row — into 8 * N areas each representing a single byte, 3) subdivide the area of each byte into 8 areas representing bits; 4) serialize text into a series of bytes and then bits, for each set bit, paint the bit area in black.

The recognition should first determine the rectangle of the code area, using the frame. This is the most challenging part. In fact, this part could be skipped. One could assume that the user of this software can take the scanner, align the page very accurately, pre-scan it, manually select the area of the code precisely, and finally scan only the rectangular area where the code is. But the quality algorithm should determine the code area automatically. I think it would be the very advanced requirement and would not insist on it (after all, the assignment did not demand it strictly). Even more challenging requirement would be handling the cases where the paper is not accurately aligned. The commercial software, to be successful, would certainly need this feature.

Now, as to recognition of the bits: my idea is based on the fact that the location of each bit rectangular area is known in advance from the size of the frame (or total size of all the code). The aspect ratio of the bit areas and number of bytes per line (N, see above) should be uses as a preexisting knowledge, part of the code conventions. The number of rows (N-byte fragments) should be determine during recognition. One useful option would be to reserve the first row to declare the length of the coded text, in unsigned byte (or other unsigned integer) representation.

This way, the recognition of each byte would not involve the search for its location. It would be enough just to determine the color. It should never be one pixel. The solution should take a set of pixels inside expected bit area, average all those pixels and tell white from black. For example, in the single-byte-per-pixel bitmap encoding, one should compare the average pixel value with 128. If it is more than that, the color is white (the bit is set, 1), otherwise it is black (the bit is cleared, 0), or visa versa.

This assignment will provide yet another test, the test in honesty. Will the student present his work with the reference to this CodeProject page or not? This is what a decent student should do.

—SA
 
Share this answer
 
I have read your clarification and Sergey's answer. Here is my addition.
First of all, let's take a look at QR code, datamatrix, or even any other 1D or 2D barcode[^]. They have been built to fulfill business needs, to fit into some special situations during business process. All have some sort of markers, that help finding, identifying, scaling and thus decoding them.
You need these two things: something you can use for alignment (finder) and an encoding method.
In your case, since you scan a whole page it is straightforward to use some 2D code, and you can rely on the reading. So you don't need special considerations. Let's assume, you are encoding a byte stream, thus encoding is not a question.
The alignment markers can be used to rectify the page if it was not scanned absolutely precisely, and to get an idea about the scale of the reading.
I would use something like the Braille pattern, but with 3x3 "dots" for a byte. That let's you have a 1 bit parity code for quality purposes. "white and black" is only one option for the bits, but you can use cell density (gray scale) ranges also (like TTL levels). It is more fail-safe. Like: 0-5% (not data, empty area of the page), 5-20%:zero, 20-60%: error zone, 60-75%:one, 75-85%:error zone, 85-100%: reserved for markers.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900