## Overview

Floating-point arithmetic is generally considered a rather occult topic. Floating-point numbers are somewhat fuzzy things whose exact values are clouded in ever growing mystery with every significant digit that is added. This attitude is somewhat surprising given the wide range of every-day applications that don't simply use floating-point arithmetic, but depend on it.

Our aim in this three part series is to remove some of the mystery surrounding floating-point math, to show why it is important for most programmers to know about it, and to show how you can use it effectively when programming for the .NET platform. In this first part, we will cover some basic concepts of numerical computing: number formats, accuracy and precision, and round-off error. We will also cover the .NET floating-point types in some depth. The second part will list some common numerical pitfalls and we'll show you how to avoid them. In the third and final part, we will see how Microsoft handled the subject in the Common Language Runtime and the .NET Base Class Library.

## Contents

## Introduction

Here's a quick quiz. What is printed when the following piece of code runs? We calculate one divided by 103 in both single and double precision. We then multiply by 103 again, and compare the result to the value we started out with:

Console.WriteLine("((double)(1/103.0))*103 < 1 is {0}.", ((double)(1/103.0))*103 < 1);
Console.WriteLine("((float)(1/103.0F))*103 > 1 is {0}.", ((float)(1/103.0F))*103 > 1);

In exact arithmetic, the left-hand sides of the comparison are equal to 1, and so the answer would be `false`

in both cases. In actual fact, `true`

is printed twice. Not only that, we get results that don't match what we would expect mathematically. Two alternative ways of performing the exact same calculation give totally contradictory results!

This example is typical of the weird behavior of floating-point arithmetic that has given it a bad reputation. You will encounter this behavior in many situations. Without proper care, your results will be unexpected if not outright undesirable. For example, let's say the price of a widget is set at $4.99. You want to know the cost of 17 widgets. You could go about this as follows:

float price = 4.99F;
int quantity = 17;
float total = price * quantity;
Console.WriteLine("The total price is ${0}.", total);

You would expect the result to be $84.83, but what you get is $84.82999. If you're not careful, it could cost you money. Say you have a $100 item, and you give a 10% discount. Your prices are all in full dollars, so you use `int`

variables to store prices. Here is what you get:

int fullPrice = 100;
float discount = 0.1F;
Int32 finalPrice = (int)(fullPrice * (1-discount));
Console.WriteLine("The discounted price is ${0}.", finalPrice);

Guess what: the final price is $89, not the expected $90. Your customers will be happy, but you won't. You've given them an extra 1% discount.

There are other variations on the same theme. Mathematical equalities don't seem to hold. Calculations don't seem to conform to what we learnt in grade three. It all looks fuzzy and confusing. You can be assured, however, that underneath it all are solid and exact mathematical computations. The aim of this article is to expose the underlying math, so you can once again go out and multiply, add, and divide with full confidence.

## Some terminology

Before we do anything, we should define the words that are commonly used in numerical computing.

### Number Formats

A computer program is a model of something in the real world. Many things in the real world are represented by numbers. Those numbers need a representation in our computer program. This is where number formats come in.

From a programmer's point of view, a number format is a collection of numbers. 99.9% of the time, the binary or internal representation is not important. It may be important when we represent non-numeric data, as with bit fields, but that doesn't concern us here. What counts is only that it can represent the numbers from the real world objects we are modeling. Some number formats include certain special values to indicate invalid values or values that are outside the range of the number format.

#### Integers

Most numbers are integers, which are easy to represent. Almost any integer you'll encounter will fit into a "32-bit signed integer", which is a number in the range -2,147,483,648 to 2,147,483,647. For some applications, like counting the number of people in the world, you need the next wider format: 64 bit integers. Its range is wide enough to count every 10^{th} of a microsecond over many millennia. (This is how a `DateTime`

value is represented internally.)

Many other numbers, like measurements, prices and percentages, are real numbers with digits after the decimal point. There are essentially two ways to represent real numbers: fixed point and floating point.

#### Fixed-point formats

A fixed-point number is formed by multiplying an integer (the *significand*) by some small scale factor, most often a negative power of 10 or 2. The name derives from the fact that the decimal point is in a fixed position when the number is written out. An example of a fixed point format is the `Currency`

type in pre .NET Visual Basic and the `Money`

type in SQL Server. These types have a range of +/-900 trillion with four digits after the decimal point. The multiplier is 0.0001 and every multiple of 0.0001 within the defined range is represented by this number format. Another example is found in the NTP protocol (Network Time Protocol), where time offsets are returned as 32 and 64 bit fixed point values with the 'binary' point at 16 and 32 bits, respectively.

Fixed point works well for many applications. For financial calculations, it has the added benefit that numbers such as 0.1 and 0.01 can be represented exactly with a suitable choice of multiplier. However, it is not suited for many other applications where a greater range is needed. Particle physicists commonly use numbers smaller than 10^{-20}, while cosmologists estimate the number of particles in the observable universe at around 10^{85}. It would be impractical to represent numbers in this range in fixed point format. To cover the whole range, a single number would take up at least 50 bytes!

#### Floating-point formats

This problem is solved with a floating point format. Floating-point numbers have a variable scale factor, which is specified as the exponent of a power of a small number called the *base*, which is usually 2 or 10. The .NET framework defines three floating-point types: `Single`

, `Double`

and `Decimal`

. That's right: the `Decimal`

type does not use the fixed point format of the `Currency`

or `Money`

type. It uses a decimal floating-point format.

A floating-point number has three parts: a sign, a significand and an exponent. The magnitude of the number equals the significand times the base raised to the exponent. Actual storage formats vary. By reserving certain values of the exponent, it is possible to define special values such as infinity and invalid results. Integer and fixed point formats usually do not contain any special values.

Before we go into the details of real life formats, we need to define some more terms.

### Range, Precision and Accuracy

The range of a number format is the interval from the smallest number in the format to the largest. The range of 16-bit signed integers is -32768 to 32767. The range of double-precision floating-point numbers is (roughly) -1e+308 to 1e+308. Numbers outside a format's range cannot be represented directly. Numbers within the range may not exist in the number format - infinitely many don't. But at least there is always a number in the format that is fairly close to our number.

Accuracy and precision are terms that are often confused, even though they have significantly different meanings.

*Precision* is a property of a number format and refers to the amount of information used to represent a number. Better or higher precision means more numbers can be represented, and also means a better resolution: the numbers that are represented by a higher precision format are closer together. 1.3333 is a number represented with a precision of five decimal digits: one before and four after the decimal point. 1.333300 is the *same* number represented with 7-digit precision.

Precision can be absolute or relative. Integer types have an absolute precision of 1. Every integer within the type's range is represented. Fixed point types, like the `Currency`

type in earlier versions of Visual Basic, also have an absolute precision. For the `Currency`

type, it is 0.0001, which means that every multiple of 0.0001 within the type's range is represented.

Floating point formats use relative precision. This means that the precision is constant relative to the size of the number. For example, 1.3331, 1.3331e+5 = 13331, and 1.3331e-3 = 0.0013331 all have 5 decimal digits of relative precision.

Precision is also a property of a calculation. Here, it refers to the number of digits used in the calculation, and in particular also the precision used for intermediate results. As an example, we calculate a simple expression with one and two digit precision:

Using one digit precision: |

0.4 * 0.6 + 0.6 * 0.4 | = 0.24 + 0.24 | Calculate products |

= 0.2 + 0.2 | Round to 1 digit |

= 0.4 | Final result |

Using two digit precision: |

0.4 * 0.6 + 0.6 * 0.4 | = 0.24 + 0.24 | Calculate products |

= 0.24 + 0.24 | Keep the 2 digits |

= 0.48 | Calculate sum |

= 0.5 | Round to 1 digit |

Comparing to the exact result (0.48), we see that using 1 digit precision gives a result that is off by 0.08, while using two digit precision gives a result that is off by only 0.02. One lesson learnt from this example is that it is useful to use extra precision for intermediate calculations if that option is available.

*Accuracy *is a property of a number in a specific context. It indicates how close a number is to its true value in that context. Without the context, accuracy is meaningless, in much the same way that "John is 25 years old" has no meaning if you don't know which John you are talking about..

Accuracy is closely related to error. Absolute error is the difference between the value you obtained and the actual value for some quantity. Relative error roughly equals the absolute error divided by the actual value, and is usually expressed in the number of significant digits. Higher accuracy means smaller error.

Accuracy and precision are related, but only indirectly. A number stored with very low precision can be exactly accurate. For example:

Byte n0 = 0x03;
Int16 n1 = 0x0003;
Int32 n2 = 0x00000003;
Single n3 = 3.000000f;
Double n4 = 3.000000000000000;

Each of these five variables represents the number 3 *exactly*. The variables are stored with different precisions, using from 8 to 64 bits. For the sake of clarity, the precision of the numbers is shown explicitly, but the precision does not have any impact on the accuracy.

Now look at the same number 3 as an approximation for pi, the ratio of the circumference of a circle to its diameter. 3 is only accurate to one decimal place, no matter what the precision. The `Double`

value uses 8 times as much storage as the `Byte`

value, but it is no more accurate.

### Round-off error

Let's say you have a non-integer number from the real world that you want to use in your program. Most likely you are faced with a problem. Unless your number has some special form, it cannot be represented by any of the number formats that are available to you. Your only solution is to find the number that is represented by a number format that is closest to your number. Throughout the lifetime of the program, you will use this approximation to your 'real' number in calculations. Instead of using the exact value *a*, the program will use a value *a*+*e*, with *e* a very small number which can be positive or negative. This number *e* is called the round-off error.

It's bad enough that you are forced to use an approximation of your number. But it gets worse. In almost every arithmetic operation in your program, the result of that operation will once again not be represented in the number format. On top of the initial round-off error, almost every arithmetic operation introduces a further error *e*_{i}. For example, adding two numbers, *a* and *b*, results in the number (*a* + *b*) + (*e*_{a} + *e*_{b} + *e*_{sum}), where *e*_{a}, *e*_{b}, and *e*_{sum} are the round-off errors of *a*, *b*, and the result, respectively. Round-off error *propagates *and is very often amplified by calculations. Fortunately, the round-off errors tend to cancel each other out to some degree, but rarely do they cancel out completely. Some calculations may also be affected more than others.

Part two of this series will have a lot more to say about round-off error and how to minimize its adverse effects.

## Standards for Floating-Point Arithmetic

Back in the '60s and '70s, computers were still very new and expensive. Every manufacturer had its own processor technology, with its own numerical formats, its own rules for handling overflows and divide-by-zeros, its own rounding rules. The situation can best be described as total anarchy.

Fortunately, the anarchy ended soon after the introduction of the personal computer. By the mid '80s, a standard had emerged that would bring some order: the IEEE-754 Standard for Binary Floating-Point Arithmetic. Intel used it in the design of their 8087 numerical co-processor. Some remnants of the old anarchy remained for some time. Microsoft continued to use its own 'Microsoft Binary Format' for floating-point numbers in its BASIC interpreters up to and including QuickBasic 3.0.

The IEEE-754 standard later became known as "IEC 60559:1989, Binary floating-point arithmetic for microprocessors". This is the official reference standard used by Microsoft in all its current specifications.

The IEC 60559 standard does more than define number formats. It sets guidelines for many aspects of floating-point arithmetic. Many of these guidelines are mandatory. Some are optional. We will see in part three of this series that Microsoft's implementation of the standard in the Common Language Runtime is incomplete. For now, we will focus on the two number formats supported by the CLR: single and double precision floating-point numbers. We will also touch on the 'extended' format, for which the IEC 60559 standard defines minimum specifications, and which is used by the floating-point unit on Intel processors, and is also used internally by the CLR.

### Anatomy of the single-precision floating-point format

We will now look at the details of the single and double precision formats. Although you can get by without knowing the internals, this information can help to understand some of the nuances of working with floating-point numbers. In order to keep the discussion clear, we will at first only consider the single precision format. The double precision format is similar, and will be summarized later.

#### Normalized numbers

A typical binary floating-point number has the form *s* × (*m* / 2^{N-1}) × 2^{e}, where *s* is either -1 or +1, *m* and *e* are the mantissa or significand and exponent mentioned earlier, and *N* is the number of bits in the significand, which is a constant for a specific number format. For single-precision numbers, *N* = 24. The numbers *s*, *m* and *e* are packed into 32 bits. The layout is shown in the image below:

part | sign | exponent | fraction |
---|

bit # | 31 | 23-30 | 0-22 |

The sign *s* is stored in the most significant bit. A value of 0 indicates a positive value, while 1 indicates a negative value.

The exponent field is an 8 bit unsigned integer called the *biased exponent*. It is equal to the exponent *e* plus a constant called the *bias* which has a value of 127 for single-precision numbers. This means that, for example, an exponent of -44 is stored as -44+127= 83 or 01010011. There are two reserved exponent values: 0 and 255. The reason for this will be explained shortly. As a result, the smallest actual exponent is -126, and the largest is +127.

The number format appears to be ambiguous: You can multiply *m* by 2 and subtract 1 from *e* and get the same number. This ambiguity is resolved by minimizing the exponent and maximizing the size of the significand. This process is called *normalization*. As a result, the significand *m* always has 24 bits, with the leading bit always equal to 1. Since we know it is always equal to 1, we don't have to store this bit, and so we end up with the significand taking up only 23 bits instead of 24.

Put another way, normalization means that the number *m* / 2^{N-1} always lies between 1 and 2. The 23 stored bits are also what comes after the decimal point when the significand is divided by 2^{N-1}. For this reason, these bits are sometimes called the *fraction*.

#### Zero and subnormal numbers

At this point, you may wonder how the number zero is stored. After all, neither *m* nor *s* can be zero, and so their product cannot be zero either. The answer is that 0 is a special number with a special representation. In fact, it has two representations!

The numbers we have been describing so far, whose significand has maximum length, are called *normalized* numbers. They represent the vast majority of numbers represented by the floating-point format. The smallest positive value is 2^{23} .2^{-126+1-24} = 1.1754e-38. The largest value is (2^{24}-1).2^{127+1-24} = 3.4028e+38.

Recall that the biased exponent has two reserved values. The biased exponent 0 is used to represent the number zero as well as *subnormal* or *denormalized* numbers. These are numbers whose significand is not normalized and has a maximum length of 23 bits. The actual exponent used is -127+1-24=-149, resulting in a smallest positive number of 2^{-149} = 1.4012e-45.

When both the biased exponent and the significand are zero, the resulting value is equal to 0. Changing the sign of zero does not change its value, so we have two possible representations of zero: one with a positive sign, and one with a negative sign. As it turns out, it *is* meaningful to have a 'negative zero' value. Although its value equals the value of normal 'positive zero', it behaves differently in some situations, which we will get into shortly.

#### Infinities and Not-a-Number

We still need to explain the use of the other reserved biased exponent value of 255. This exponent is used to represent infinities and Not-a-Number values.

If the biased exponent is all 1's (i.e. equal to 255) and the significand is all 0's, then the number represents infinity. The sign bit indicates whether we're dealing with positive or negative infinity. These numbers are returned for operations that either do not have a finite value (e.g. 1/0) or are too large to be represented by a normalized number (e.g. 2^{1,000,000,000}).

The sign of a division by zero depends on the sign of both the numerator and the denominator. If you divide +1 by negative zero, the result is negative infinity. If you divide -1 by positive infinity, the result is negative zero.

If the significand is different from 0, the value represents a Not-a-Number value or NaN. NaNs come in two flavors: signaling and non-signaling or quiet corresponding to the leading bit in the significand being 1 and 0, respectively. This distinction is not very important in practice, and is likely to be dropped in the next revision of the standard.

NaNs are produced when the result of a calculation does not exist (e.g. `Math.Sqrt(-1)`

is not a real number) or cannot be determined (infinity / infinity). One of the peculiarities of NaNs is that all arithmetic operations involving NaNs return a NaN, except when the result would be the same regardless of the value. For example, the function `hypot(x, y) = Math.Sqrt(x*x+y*y)`

with `x`

infinite always equals positive infinity, regardless of the value of `y`

. As a result, `hypot(infinity, NaN)`

= infinity.

Also, any comparison of a NaN with any other number including NaN returns false. The one exception is the inequality operator, which always returns true even if the value being compared is also NaN!

The significand bits of a NaN can be set to an arbitrary value, sometimes called the *payload*. The IEC 60559 standard specifies that the payload should propagate through calculations. For example, when a NaN is added to a normal number, say 5.3, then the result is a NaN with the same payload as the first operand. When both operands are NaNs, then the resulting `NaN`

carries the payload of either one of the operands. This leaves the possibility to pass on potentially useful information in NaN values. Unfortunately, this feature is hardly ever used.

#### Some examples

Let's look at some numbers and their corresponding bit patterns.

Number | Sign | Exponent | Fraction |
---|

0 | 0 | 00000000 | 00000000000000000000000 |

-0 | 1 | 00000000 | 00000000000000000000000 |

1 | 0 | 01111111 | 00000000000000000000000 |

+Infinity | 0 | 11111111 | 00000000000000000000000 |

NaN | 1 | 11111111 | 10000000000000000000000 |

3.141593 | 0 | 10000000 | 10010010000111111011100 |

-3.141593 | 1 | 10000000 | 10010010000111111011100 |

100000 | 0 | 10001111 | 10000110101000000000000 |

0.000001 | 0 | 01101110 | 01001111100010110101100 |

1/3 | 0 | 01111101 | 01010101010101010101011 |

4/3 | 0 | 01111111 | 01010101010101010101011 |

2^{-144} | 0 | 00000000 | 00000000000000000100000 |

Notice the exponent field for 1 and 4/3. Both these numbers are between 1 and 2, and so their unbiased exponent is zero. The biased exponent is therefore equal to the bias, which is 127, or 1111111 in decimal. Numbers larger than 2 have biased exponents greater than 127. Numbers smaller than 1 have biased exponents smaller than 127.

The last number in the table (2^{-144}) is denormalized. The biased exponent is zero, and since 2^{-144} = 32*2^{-149 }the fraction is 32 = 2^{5}.

### Double and extended precision formats

Double-precision floating-point numbers are stored in a way that is completely analogous to the single-precision format. Some of the constants are different. The sign still takes up 1 bit - no surprise there. The biased exponent takes up 11 bits, with a bias value of 1023. The significand takes up 52 bits with the 53^{rd} bit implicitly set to 1 for normalized numbers.

The IEC 60559 standard doesn't specify exact values for the parameters of the extended floating-point format. It only specifies minimum values. The extended format used by Intel processors since the 8087 is 80 bits long, with 15 bits for the exponent and 64 bits for the significand. Unlike other formats, the extended format does leave room for the leading bit of the significand, enabling certain processor optimizations and saving some precious real estate on the chips.

The following table summarizes the features of the single, double and-precision formats.

Format | Single | Double | Extended |
---|

Length (bits) | 32 | 64 | 80 |

Exponent bits | 8 | 11 | 15 |

Exponent bias | 127 | 1023 | 16383 |

Smallest exponent | -126 | -1022 | -16382 |

Largest exponent | +127 | +1023 | +16383 |

Precision | 24 | 53 | 64 |

Smallest positive value | 1.4012985e-45 | 2.4703282292062327e-324 | 1.82259976594123730126e-4951 |

Smallest positive normalized value | 1.1754944e-38 | 2.2250738585072010e-308 | 3.36210314311209350626e-4932 |

Largest positive value | 3.4028235e+38 | 1.7976931348623157e+308 | 1.18973149535723176502e+4932 |

#### What about the decimal format?

The `Decimal`

type in the .NET framework is a non-standard floating-point type with base 10. It takes up 128 bits. 96 of those are used for the mantissa. 1 bit is used for the sign, and 5 bits are used for the exponent, which can range from 0 to 28. The format does not follow any existing or planned standard. There are no infinities or NaNs.

Any decimal number of no more than 28 digits before and/or after the decimal point can be represented exactly. This is great for financial calculations, but comes at a significant cost. Calculating with decimals is an order of magnitude slower than the intrinsic floating point types. Decimals also take up at least twice as much memory.

### Other parts of the standard

In addition to the number formats, the IEC 60559 standard also precisely defines the behavior of the basic arithmetic operations +, -, *, /, and square root.

It also specifies the details of rounding. There are four possible ways to round a number, called *rounding modes* in floating-point jargon:

- towards the nearest number (round up or down, whichever produces the smaller error)
- towards zero (round down for positive numbers, and up for negative numbers)
- towards +infinity (always round up)
- towards -infinity (always round down)

In general, the first option will lead to smaller round-off error, which is why it is the default in most compilers. However, it is also the least predictable. The other rounding modes have more predictable properties. In some cases, it is more easy to compensate for round-off error using these modes.

Exceptions are another but underused feature. Exceptions signal that something unusual has happened during a calculation. Exceptions are not fatal errors. A flag is set and a default value is returned. There are five exceptions in all:

Exception | Situation | Return value |
---|

Invalid operation | An operand is invalid for the operation to be performed. | NaN |

Division by zero | An attempt is made to divide a non-zero value by zero. | Infinity (1/-0 = negative infinity) |

Overflow | The result of an operation is too large to be represented by the floating-point format. | Positive or negative infinity. |

Underflow | The result of an operation is too small to be represented by the floating-point format. | Positive or negative zero. |

Inexact | The rounded result of an operation is not exact. | The calculated value. |

The return value in case of overflow and underflow actually depends on the rounding mode. The values given are those for rounding to nearest, which is the default.

Exceptions are not fatal errors. They act similar to integer overflows in the CLR. By default, no action is taken in case of overflow. However in a checked context, an exception is thrown when integer overflow occurs. Similarly, the IEEE-754/IEC 60559 defines a trap mechanism that passes control over to a trap handler when an exception occurs.

## Finally, some real code

Nearly all discussions up to this point have been theoretical. It's time to do some coding!

We will stick with the IEEE-754 standard and implement some of the 'recommended functions' listed in the annex to the standard for double-precision numbers. These are:

Function | Description |
---|

`CopySign(`*x*, *y*) | Copies the sign of `y` to `x` . |

`Scalb(`*y*, *n*) | Computes *y*2^{n} for integer `n` without computing 2^{n}. |

`Logb(`*x*) | Returns the unbiased exponent of `x` . |

`NextAfter(`*x*, *y*) | Returns the next representable neighbor of `x` in the direction of `y` . |

`Finite(`*x*) | Returns true if `x` is a finite real number. |

`Unordered(`*x*, *y*) | Returns true if `x` and `y` are unordered, i.e. if either `x` or `y` is a NaN. |

`Class(`*x*) | Returns the floating-point class of the number `x` . |

We won't go into much detail here. The code is mostly self-explanatory. There are a few points of interest.

### Converting to and from binary representations

Most of these functions perform some sort of operation on the binary representation of a floating-point number. A single value has the same number of bits as an `int`

, and a double value has the same number of bits as a `long`

.

For double values, the `BitConverter`

class contains two useful methods: `DoubleToInt64Bits`

and `Int64BitsToDouble`

. As the names suggest, these methods convert a `double`

to and from a 64-bit integer. There is no equivalent for Single values. Fortunately, one line of unsafe code will do the trick.

### Finding the next representable neighbor

Finding the next representable neighbor of a floating-point number, which is the purpose of the `NextAfter`

method, appears to be a rather complicated operation. We have to deal with positive and negative, normalized and denormalized numbers, exponents and significands, as well as zeros, infinities, and NaNs!

Fortunately, a special property of the floating-point formats comes to the rescue: the values are ordered like sign-magnitude integers. What this means is that, setting aside the sign bit for the moment, the order of the floating-point numbers and their binary representation is the same. So, all we have to do to find the next neighbor is to increment or decrement the binary representation. There are a few special cases, but all the handling of exponents is taken care of.

## Conclusion

In this first article in a three-part series, we introduced the basic concepts of numerical computing: number formats, accuracy, precision, range, and round-off error. We described the most common number formats (single, double and extended precision) and the standard that defines them. Finally, we wrote some code to implement some floating-point functions.

The next part in this series will be of much more direct practical value. We will look more in depth at the dangers that come with doing calculations with floating-point numbers, and we'll show you how you can avoid them.

## References