I didn't think about using a typedef. I understand how to use in-line assembly, but without the typedef, trying to do
fld tbyte ptr var
Thank you for the answer.
I have re-written ENT in MASM to try to speed it up for huge files. I first modified ENT to handle the huge files, but it was terribly slow (for my 16 GB file). It took 32 minutes to process. I have cut that down to 10 minutes with my entmasm, and I am trying to validate the calculations by using printf statements in both to display all of the calculation inputs and outputs (for a VERY SHORT 10 BYTE file), but in the entropy calculations, the results do not quite match. The C compiler (visual Studio 2008) is optimizing the code and keeps most of the intermediate results on the FPU stack in temporary real format. My masm code was pretty brutal and kept each calculation separate, saving results in memory. I was using doubles (real8) for storage, and I think that the loss of precision when saving an FPU value as a real8 is what caused the result differences. I will convert my masm code to use tbytes and modify the ENT C code to do the same, and see if that makes the results match.
One other question. ENT has several 256 entry arrays for the counts and probabilities. If I change these arrays of 256 tbytes (10 BYTES each), will the mod 16 mis-alighment cause performance problems? Would it be better to make them arrays of owords (16 BYTES) and just index by 16 instead of by 10? I am not worried about the extra memory, I have GIGA BYTES of unused memory.
I have already implemented the DQWORD arrays for my MASM version, and am adding printf statements of intermediate calculations. The change to use TBYTES did fix the differences I had seen in the calculated ENT value between the C version and my MASM version. This was not as simple as it looked to be at first. The only FPU instructions than can use TBYTES are FLD and FST. You cannot use FADD TBYTE PTR [i], but I can see where the xxxP FPU instructions come from:
It turns out that this is exactly what the FPU needs to do for an faddd val - it must push the stack, load the double/float/integer into st0 and convert it to temporary real, then add/sub/mul/div ST(1) by ST(0) and put the result into ST(1), then pop the stack leaving the result in ST(0). With TBYTES you just have to do it manually, - BUT - The FPU doesn't have to do any conversions - the TBYTE is already in temporary real format and can contain a signed QWORD (63 bits, + sign). Unfortunately, the FPU cannot handle unsigned 64 bit values (they end up as negative values).
The biggest improvement I got was changing from fgetc for each character to reading 65536 bytes into a buffer (with no system buffering) and indexing through the BYTES, then reading more from the file into the buffer and processing. Another interesting change was to fill the buffer, initialize one time for the FIRST character, then skip to process the characters, skipping around the subsequent re-fill buffer entry point. So little extra code, BUT, avoided checking if this was the first character 16 billion times as was done in ENT. Another speedup was to fragment the character occurrence buffer - I had to grow the collection bins to a QWORD for supporting a max file (2^63 BYTES), but in the BIT mode this increased the count to 2^66. I had to accumulate the counts in three DWORDS in a DQWORD, using "add value, adc 0, adc 0," but this occurred (in my 16 GB test) 16 billion times. With a smaller buffer that could contain the counts in a DWORD, it was just an "add value" for the buffer count, then 256 iterations of "add value, adc 0, adc 0," to accumulate the 256 occurrence values in the DQWORD array and clear the DWORD counts between buffer fills. Also for checking whether a bin had any count at all (several places in ENT checked this), I accumulated (at the end of the file processing) the 3 DWORDS for each entry into the fourth DWORD that could be tested with a single instruction.
But I digress from a simple C question into something more appropriate for Algorithms.