## Introduction

This program can transform a floating point number to its bytes expression or transform a bytes expression to a floating point number.

## Background

Have you ever tried to develop a program to read a DLIS (Digital Log Interchange Standard) format data file? I found that the sample log data was recorded as VAX single float format. So, I had to read a 4 bytes stream from the binary file, and then recover the real number. I succeeded to recover all the frame data for all the channels. I compared my result with the output of the Schlumberger free tool program, Toolbox. They were identical. I also did some test using a free Java package, Cynosurex, and it gave the same result. This reminded me some of some halfway jobs I did about floating point number and bytes order analysis five years ago, and inspired my enthusiasm to proceed again.

## Bits expression of floating point number

IEEE single precision floating point:

SEF : S EEEEEEEE FFFFFFF FFFFFFFF FFFFFFFF
bits : 1 2 9 10 32
bytes : byte1 byte2 byte3 byte4

IEEE double precision floating point:

SEF: S EEEEEEE EEEE FFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
bits: 1 2 12 13 64
bytes: byte1 byte2 byte3 byte4 byte5 byte6 byte7 byte8
frctn.: L1 L2

IBM single precision floating point:

SEF : S EEEEEEE FFFFFFFF FFFFFFFF FFFFFFFF
bits : 1 2 8 9 32
bytes : byte1 byte2 byte3 byte4

IBM double precision floating point:

SEF: S EEEEEEE FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
bits: 1 2 8 9 64
bytes: byte1 byte2 byte3 byte4 byte5 byte6 byte7 byte8
frctn.: L1 L2

VAX single precision floating point:

SEF : S EEEEEEEE FFFFFFF FFFFFFFF FFFFFFFF
bits : 1 2 9 10 32
bytes : byte2 byte1 byte4 byte3

## General encoding formula of the floating point

V = (-1)<SUP>S</SUP> * M * A<SUP>( E - B )</SUP>
M = C + F

V is the value, S is the sign, M is called mantissa, A is base, E is exponent, B is exponent bias, C is mantissa constant, and F is fraction. A, B and C are constants that could be different with the floating point architecture. Here are some of them:

IEEE single float : A = 2 B = 127 C = 1
IEEE double float : A = 2 B = 1023 C = 1
IBM single float : A = 16 B = 64 C = 0
IBM double float : A = 16 B = 64 C = 0
VAX single float : A = 2 B = 128 C = 0.5

## Maximum value of the fraction

As mentioned above, F is the fraction. The minimum value of the IEEE and VAX fraction F is 0, and IBM fraction minimum value is 1/16. F is zero means all fraction bits (F of the bits expression above ) are 0. The maximum value of the fraction will be reached when all fraction bits are 1. To figure out it, we have to use a little high school mathematics, can you remember this formula?

1/2 + 1/4 + 1/8 + ... + 1/2<SUP>n</SUP> = 1 - 1/2<SUP>n</SUP>

The only easy ignored detail here is about the VAX single precision floating point. Except its wired bytes order, its fraction bits segment starts from 1/4, not from 1/2 as IEEE or IBM. This is another example for that complexity is always from personality.

G is the maximum value of the fraction F
IEEE single float : G = 1 - 1/2<SUP>23</SUP>
IEEE double float : G = 1 - 1/2<SUP>52</SUP>
IBM single float : G = 1 - 1/2<SUP>24</SUP>
IBM double float : G = 1 - 1/2<SUP>56</SUP>
VAX single float : G = 1 - 1/2<SUP>24</SUP> - 1/2

## Mantissa range

It is easy to figure out the mantissa range based on the above values of C and G. The IBM float mantissa minimum value will be explained below.

IEEE single float : 1 <= M <= 2 - 1/2<SUP>23</SUP>
IEEE double float : 1 <= M <= 2 - 1/2<SUP>52</SUP>
IBM single float : 1/16 <= M <= 1 - 1/2<SUP>24</SUP>
IBM double float : 1/16 <= M <= 1 - 1/2<SUP>56</SUP>
VAX single float : 1/2 <= M <= 1 - 1/2<SUP>24</SUP>

## Bytes order

I use a simple union data structure and a two bytes unsigned short integer 258 to find the kind of bytes order for your memory to store the number. For Little Endian architecture, such as: Intel, this function will return 2; for Big Endian architecture, such as: SPARC, this function will return 1.

## Transform bytes to floating point

There are two steps for the transformation. The first step is to transform bytes to SEF, which means Sign, Exponent, and Fraction. The second step is to transform SEF to a floating point number.

### 1. Bytes to SEF:

Firstly, adjust the incoming bytes order to fit the above bits expression of floating point, then the SEF values can be gotten through some bits operation based on the above bits expression. For double precision floating point, I decompose the fraction into two parts: two unsigned long integers, which are L1 and L2.

### 2. SEF to floating point:

It is easy to recover the floating point number from SEF based on the above general encoding formula and the three constants A, B and C.

## Transform floating point to bytes

Similar with the above method, there are two steps for the transformation. The first step is to transform the floating point to SEF. The second step is to transform SEF to bytes.

### 1. Floating point to SEF:

This part is the most important in all programs. I developed two methods to calculate the E and F from the floating point number.

The first method is more natural. Its principle is same as transforming an integer to its binary expression, which gets every bit through continually dividing the base 2. In our case, we can continually divide or multiply the base till the quotient settles within the mantissa range mentioned above. The choice of divide or multiply depends on whether the value E-B is positive or negative, but it is impossible to know the sign of the value E-B before E is known. Actually, we can determine it through comparing the floating point number with the mantissa bound value. The loop times is used to determine the E value, meantime, the surplus value of the original floating point after the loop is used to determine the F value.

The second method is a little complex. It uses the reverse algorithm to figure out the E value by the logarithm, then it is easy to get the F value by the above encoding formula. Actually, we can conclude the following formula for E and F:

V is the floating point number
D = log2 , base is e
IEEE single float : E = (int) ( logV / D + B )
IEEE double float : E = (int) ( logV / D + B )
IBM single float : E = (int) ( ( logV / D ) / 4 + 1 + B )
IBM double float : E = (int) ( ( logV / D ) / 4 + 1 + B )
VAX single float : E = (int) ( ( logV / D ) + 1 + B )
F = V / A<SUP>(E-B)</SUP> - C

I will give a brief proof for these formulae. The zero value and the sign S can be harmlessly ignored, so the float value is assumed as positive: V > 0.

- IEEE float:
The mantissa range of the IEEE float is: 1 <= M < 2 . So: 0 <= logM / log2 < 1. Notice: E is a non-negative integer, so:

(int) ( logV / log2 + B )
= (int) ( logM / log2 + ( E-B ) + B )
= (int) ( logM / log2 + E )
= E

- IBM float:
The key point here is mentioned in the RP66 reference document: *Bits 1 - 4 of byte 2 may not all be zero except for true zero. In other words, the first hexadecimal digit of the mantissa must be non-zero, except for true zero*. This means the IBM float mantissa minimum value is 1/16. So: 0 < logM / log16 + 1 < 1. Notice: E is a non-negative integer.

(int) ( ( logV / log2 ) / 4 + 1 + B )
= (int) ( logM / log16 + ( E-B ) + 1 + B )
= (int) ( logM / log16 + 1 + E )
= E

- VAX float:
The VAX float mantissa minimum value is 1/2, so: 0 < logM / log2 + 1 < 1. Notice: E is a non-negative integer.

(int) ( logV / log2 + 1 + B )
= (int) ( logM / log2 + ( E-B ) + 1 + B )
= (int) ( logM / log2 + 1 + E )
= E

### 2. SEF to bytes

There are no difficult things for this part. It just needs bytes order adjustment and some bits operation.

The programs also include a regular Union method to transform an IEEE float and its bytes expression.

## Compile

This program was compiled in the MinGW environment in Windows-XP. You have to set the PATH environment variable or run *setp.bat* before compiling.

set PATH = %PATH%; C:\MinGW\bin ;

Then run *clib.bat* to create the library, or manually compile the program as follows:

del libNumber.lib
del *.o
g++ -c ByteOrder.c
g++ -c Float2SEF.c
g++ -c SEF2Byte.c
g++ -c Byte2SEF.c
g++ -c SEF2Float.c
g++ -c IeeeFloat.c
g++ -c IbmFloat.c
g++ -c VaxFloat.c
g++ -c TestlibNumber.c
ar m libNumber.lib
ar r libNumber.lib *.o
ar t libNumber.lib
del *.o

Run *cpsam.bat* to compile the two test programs as follows:

cpsam test1
cpsam test2

After all, you can run the *test1.exe* program in a DOS window. You also can redirect the output to a text file as follows:

test1 > test.txt

*test2.exe* is another test program to test any float number for this library.

## Reference

I have wrapped all reference web pages into my source code Zip package.