Click here to Skip to main content
Click here to Skip to main content

Interpreting Intel 80-bit Long Double Byte Arrays

, 5 Apr 2004
Rate this:
Please Sign up or sign in to vote.
A simple BitConverter class that is capable of reading and writing Intel 80-bit long doubles.

Abstract

I recently found myself needing to read Intel 80-bit long doubles from a binary stream whilst integrating with another system. For cross platform compatibility reasons, Microsoft decided not to include a long double (a.k.a. extended) type in the framework and thus I was forced to interpret the bytes myself.

This is not an implementation of a long double type, but a BitConverter class that translates between long double byte arrays and regular doubles. Naturally, a certain amount of range and precision is lost during the process; however, for my purposes, this is acceptable.

Introduction

A long double is a floating point number that is 16 bits bigger than a regular double. These additional bits are used to increase both the range and precision of the number and are usually used for mathematical and scientific calculations.

This article describes how to read a long double byte array and create a regular double value from it. The inverse operation is simply a matter of reversing the procedure and will not be covered here.

Problem

The first step is to identify the different kinds of long doubles and their equivalent regular doubles. The rules that define each of these states can be found in the IEEE Standard 754 specification (see References).

Long Double Double Equivalent
Unsupported Unsupported
Normal Zero, Subnormal, Normal or Overflow (depends on e)
Subnormal Zero
Pseudo-Denormal Zero
Signed Zero Zero
Positive Infinity Positive Infinity
Negative Infinity Negative Infinity
Quiet NaN NaN
Signaling NaN NaN

There are three main kinds of long doubles; normal, subnormal and pseudo-denormal. Each of these represent adjacent (but overlapping) ranges of numbers, pseudo-denormal being the smallest, followed by subnormal, then normal. Since the entire range of a regular double fits into the range of a normal long double, subnormal and pseudo-denormal numbers must round to zero.

The internal bytes of a long double consist of four parts (or fields); a sign bit (s), a biased exponent (e), a significand bit (j) and a fraction (f). Doubles are arranged in a similar way except that they don't have a significand bit and the exponent and fraction fields are smaller. The following diagram illustrates these differences and shows how the fields are translated.

Field layouts and the translation between them.

Now you're probably wondering what j is and what happens to it when translating to a double. I found it was of little significance and was only used when detecting the unsupported state, so I won't go into it here.

Solution

So, now the problem can be solved by determining the type of long double and then translating e and f (if it happens to be a normal number).

Starting with f, the following code translates it into a value suitable for a double.

f >>= 11;

It is shifted to suit the 11 less bits that are available in the double's fraction field. Next, we translate e.

e -= (0x3FFF - 0x3FF);

Since e is biased, to take it from 15 bits to 11 bits involves subtracting the original bias (214 - 1) and adding the new one (210 - 1).

Clearly e can still fall outside the range of an 11-bit number after this translation. If e is too high, then the number is too big and cannot be represented by a double. An OverflowException is thrown in this case. If e is less than 0 then the number is too small and will be rounded to zero.

However, if e is no less than -51, it can be salvaged by translating it into a subnormal double using some careful bit manipulation. The following code does just this:

if(e >= 0x7FF) //outside the range of a double
    throw new OverflowException();
else if(e < -51) //too small to translate into subnormal
    return 0;
else if(e < 0) //big enough to translate into subnormal
{
    f |= 0x10000000000000;
    f >>= (1 - e);
    e = 0;
}

To understand the above translation, it is important to understand the mathematical representation of a normal double. The following is a (simplified) definition.

2<SUP>e-1023</SUP> x 1.f, e > 0

The more we reduce e below 1023, the more times we end up halving the resulting number. In this case, we have tried to reduce e past 0 which is not allowed. Another way to reduce the result is to halve the f component instead, that is, bit shifting it to the right. Since the fraction field of a double is 52 bits wide, 52 becomes the maximum number of shifts we can do before we're left with zero.

The last step in the process is to create a double byte array with our new field values and use the standard BitConverter to read it into a double.

Using the Code

The code is used in the same way that the System.BitConverter class is used. To read a long double byte array, use ToDouble() specifying a byte array and start index. To generate a byte array from a double, use BetBytes() specifying the double.

Conclusion

This project proved to be an interesting look into the structure and bitwise manipulation of floating point numbers.

Doubles are surprisingly easy to construct and deconstruct once you understand their internals.

References

History

  • 2004-04-06: Initial release.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Nathan Baulch
Web Developer
Australia Australia
I work for a small software company that builds sales force automation software.

Comments and Discussions

 
Generaldecimal Pinmembermaliger4-Aug-05 22:39 
GeneralOverflow to return +/- Inf PinmemberAndrew Phillips8-Nov-04 13:12 
One thing that some might find confusing is your use of the term "long double" for 80-bit IEEE 754 floating point numbers. This is actually a C/C++ data type that may be implemented using 80-bit numbers, but could also be implemented using 64, 128, or even 32 bit numbers. Typically, most compilers use 64 bit numbers for both double and long double.
 
Also I don't know anything about 80-bit IEEE 754 numbers but I know a little about 64 and 32 bit ones. What do you mean by "Unsupported" that appears in the table for both 64 and 80 bit numbers? What bit pattern does that have for 64 bit numbers?
 
64 bit numbers also support quiet and signalling NaNs. Why not convert them to the same type?
 
Apart from that it's a great article. One suggestion would be an option to return +/- Infinity rather than throw an exception on overflow. Actually someone might also want an underflow to throw an exception rather than return zero, too.

 
Andrew Phillips
aphillips @ expertcomsoft.com
GeneralBig Endian PinmemberPaul Selormey6-Apr-04 14:41 
GeneralRe: Big Endian PinmemberNathan Baulch6-Apr-04 14:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140709.1 | Last Updated 6 Apr 2004
Article Copyright 2004 by Nathan Baulch
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid