Click here to Skip to main content
13,054,779 members (76,396 online)
Click here to Skip to main content
Add your own
alternative version


143 bookmarked
Posted 24 Sep 2008

Five Tips for Floating Point Programming

, 26 Jun 2014
Rate this:
Please Sign up or sign in to vote.
This article covers five of the most important things someone needs to know when working with floating point numbers.


There are several traps that even very experienced programmers fall into when they write code that depends on floating point arithmetic. This article explains five things to keep in mind when working with floating point numbers, i.e. float and double data types.

Don't Test for Equality

You almost never want to write code like the following:

double x;
double y;
if (x == y) {...}

Most floating point operations involve at least a tiny loss of precision and so even if two numbers are equal for all practical purposes, they may not be exactly equal down to the last bit, and so the equality test is likely to fail. For example, the following code snippet prints -1.778636e-015. Although in theory, squaring should undo a square root, the round-trip operation is slightly inaccurate.

double x = 10; 
double y = sqrt(x);
y *= y;
if (x == y)
    cout << "Square root is exact\n";
    cout << x-y << "\n";

In most cases, the equality test above should be written as something like the following:

double tolerance = ...
if (fabs(x - y) < tolerance) {...}

Here tolerance is some threshold that defines what is "close enough" for equality. This begs the question of how close is close enough. This cannot be answered in the abstract; you have to know something about your particular problem to know how close is close enough in your context.

Worry about Addition and Subtraction more than Multiplication and Division

The relative errors in multiplication and division are always small. Addition and subtraction, on the other hand, can result in complete loss of precision. Really the problem is subtraction; addition can only be a problem when the two numbers being added have opposite signs, so you can think of that as subtraction. Still, code might be written with a "+" that is really subtraction.

Subtraction is a problem when the two numbers being subtracted are nearly equal. The more nearly equal the numbers, the greater the potential for loss of precision. Specifically, if two numbers agree to n bits, n bits of precision may be lost in the subtraction. This may be easiest to see in the extreme: If two numbers are not equal in theory but they are equal in their machine representation, their difference will be calculated as zero, 100% loss of precision.

Here's an example where such loss of precision comes up often. The derivative of a function f at a point x is defined to be the limit of (f(x+h) - f(x))/h as h goes to zero. So a natural approach to computing the derivative of a function would be to evaluate (f(x+h) - f(x))/h for some small h. In theory, the smaller h is, the better this fraction approximates the derivative. In practice, accuracy improves for a while, but past some point smaller values of h result in worse approximations to the derivative. As h gets smaller, the approximation error gets smaller but the numerical error increases. This is because the subtraction f(x+h) - f(x) becomes problematic. If you take h small enough (after all, in theory, smaller is better) then f(x+h) will equal f(x) to machine precision. This means all derivatives will be computed as zero, no matter what the function, if you just take h small enough. Here's an example computing the derivative of sin(x) at x = 1.

cout << std::setprecision(15);
for (int i = 1; i < 20; ++i)
    double h = pow(10.0, -i);
    cout << (sin(1.0+h) - sin(1.0))/h << "\n";
cout << "True result: " << cos(1.0) << "\n";

Here is the output of the code above. To make the output easier to understand, digits after the first incorrect digit have been replaced with periods.

True result: 0.54030230586814

The accuracy improves as h gets smaller until h = 10<sup>-8</sup>. Past that point, accuracy decays due to loss of precision in the subtraction. When h = 10<sup>-16 </sup>or smaller, the output is exactly zero because sin(1.0+h) equals sin(1.0) to machine precision. (In fact, 1+h equals 1 to machine precision. More on that below.)

(The results above were computed with Visual C++ 2008. When compiled with gcc 4.2.3 on Linux, the results were the same except of the last four numbers. Where VC++ produced zeros, gcc produced negative numbers: -0.017..., -0.17..., -1.7..., and 17....)

What do you do when your problem requires subtraction and it's going to cause a loss of precision? Sometimes the loss of precision isn't a problem; doubles start out with a lot of precision to spare. When the precision is important, it's often possible to use some trick to change the problem so that it doesn't require subtraction, or doesn't require the same subtraction that you started out with.

See the CodeProject article Avoiding Overflow, Underflow, and Loss of Precision for an example of using algebraic trickery to change the quadratic formula into form more suitable for retaining precision. See also comparing three methods of computing standard deviation for an example of how algebraically equivalent methods can perform very differently.

Floating Point Numbers have Finite Ranges

Everyone knows that floating point numbers have finite ranges, but this limitation can show up in unexpected ways. For example, you may find the output of the following lines of code surprising.

float f = 16777216; 
cout << f << " " << f+1 << "\n";

This code prints the value 16777216 twice. What happened? According to the IEEE specification for floating point arithmetic, a float type is 32 bits wide. Twenty four of these bits are devoted to the significand (what used to be called the mantissa) and the rest to the exponent. The number 16777216 is 2<sup>24</sup> and so the float variable f has no precision left to represent f+1. A similar phenomena would happen for 2<sup>53</sup> if f were of type double because a 64-bit double devotes 53 bits to the significand. The following code prints 0 rather than 1.

x = 9007199254740992; // 2^53
cout << ((x+1) - x) << "\n";

We can also run out of precision when adding small numbers to moderate-sized numbers. For example, the following code prints "Sorry!" because DBL_EPSILON (defined in float.h) is the smallest positive number e such that 1 + e != 1 when using double types.

x = 1.0;
y = x + 0.5*DBL_EPSILON;
if (x == y)
    cout << "Sorry!\n";

Similarly, the constant FLT_EPSILON is the smallest positive number e such that 1 + e is not 1 when using float types.

Use Logarithms to Avoid Overflow and Underflow

The limitations of floating point numbers described in the previous section stem from having a limited number of bits in the significand. Overflow and underflow result from also having a finite number of bits in the exponent. Some numbers are just too large or too small to store in a floating point number.

Many problems appear to require computing a moderate-sized number as the ratio of two enormous numbers. The final result may be representable as a floating point number even though the intermediate results are not. In this case, logarithms provide a way out. If you want to compute M/N for large numbers M and N, compute log(M) - log(N) and apply exp() to the result. For example, probabilities often involve ratios of factorials, and factorials become astronomically large quickly. For N > 170, N! is larger than DBL_MAX, the largest number that can be represented by a double (without extended precision). But it is possible to evaluate expressions such as 200!/(190! 10!) without overflow as follows:

x = exp( logFactorial(200) 
       - logFactorial(190) 
       - logFactorial(10) );

A simple but inefficient logFactorial function could be written as follows:

double logFactorial(int n)
    double sum = 0.0;
    for (int i = 2; i <= n; ++i)
        sum += log((double)i);
    return sum;

A better approach would be to use a log gamma function if one is available. See How to calculate binomial probabilities for more information.

Numeric Operations don't Always Return Numbers

Because floating point numbers have their limitations, sometimes floating point operations return "infinity" as a way of saying "the result is bigger than I can handle." For example, the following code prints 1.#INF on Windows and inf on Linux.

x = DBL_MAX;
cout << 2*x << "\n";

Sometimes the barrier to returning a meaningful result has to do with logic rather than finite precision. Floating point data types represent real numbers (as opposed to complex numbers) and there is no real number whose square is -1. That means there is no meaningful number to return if code requests sqrt(-2), even in infinite precision. In this case, floating point operations return NaNs. These are floating point values that represent error codes rather than numbers. NaN values display as 1.#IND on Windows and nan on Linux.

Once a chain of operations encounters a NaN, everything is a NaN from there on out. For example, suppose you have some code that amounts to something like the following:

if (x - x == 0)
    // do something

What could possibly keep the code following the if statement from executing? If x is a NaN, then so is x - x and NaNs don't equal anything. In fact, NaNs don't even equal themselves. That means that the expression x == x can be used to test whether x is a (possibly infinite) number. For more information on infinities and NaNs, see IEEE floating point exceptions in C++.

For More Information

The article What Every Computer Scientist Should Know About Floating-Point Arithmetic explains floating point arithmetic in great detail. It may be what every computer scientist would know ideally, but very few will absorb everything presented there.


  • 24th September, 2008: Original post
  • 29th October, 2008: Added reference, modified code to also compile with gcc, reported VC++ vs gcc difference in one example


This article, along with any associated source code and files, is licensed under The BSD License


About the Author

John D. Cook
Singular Value Consulting
United States United States
I am an independent consultant in software development and applied mathematics. I help companies learn from their data to make better decisions.

Check out my blog or send me a note.


You may also be interested in...

Comments and Discussions

GeneralMy vote of 3 Pin
KarstenK6-Aug-14 3:29
memberKarstenK6-Aug-14 3:29 
GeneralRe: My vote of 3 Pin
Andreas Gieriet12-Sep-14 5:37
professionalAndreas Gieriet12-Sep-14 5:37 
GeneralRe: My vote of 3 Pin
KarstenK11-Oct-16 5:44
memberKarstenK11-Oct-16 5:44 
GeneralRe: My vote of 3 Pin
Andreas Gieriet11-Oct-16 10:32
professionalAndreas Gieriet11-Oct-16 10:32 
GeneralExcellent summary of the pitfalls for the beginner Pin
JWhattam2-Jul-14 12:55
memberJWhattam2-Jul-14 12:55 
SuggestionFloating point precision Pin
Dave Vroman27-Jun-14 10:20
professionalDave Vroman27-Jun-14 10:20 
GeneralRe: Floating point precision Pin
Andreas Gieriet12-Sep-14 5:15
professionalAndreas Gieriet12-Sep-14 5:15 
GeneralRe: Floating point precision Pin
Dave Vroman12-Sep-14 9:33
professionalDave Vroman12-Sep-14 9:33 
GeneralRe: Floating point precision Pin
Andreas Gieriet12-Sep-14 21:43
professionalAndreas Gieriet12-Sep-14 21:43 
QuestionOne other sticky issue Pin
Frank Willett27-Jun-14 8:44
memberFrank Willett27-Jun-14 8:44 
GeneralIt depends on compiler. Pin
SMD11127-Jun-14 7:57
memberSMD11127-Jun-14 7:57 
QuestionThanks for sharing, may I suggest discussing decimal vs double as well Pin
Mårten R30-Jan-14 6:28
memberMårten R30-Jan-14 6:28 
AnswerRe: Thanks for sharing, may I suggest discussing decimal vs double as well Pin
Mike Riley - QUSA27-Jun-14 11:06
memberMike Riley - QUSA27-Jun-14 11:06 
Questionavoid devision by 0 and NaN Pin
pip0108-May-13 5:44
memberpip0108-May-13 5:44 
GeneralMy vote of 2 Pin
Juan Falgueras Cano18-Apr-13 1:01
memberJuan Falgueras Cano18-Apr-13 1:01 
GeneralRe: My vote of 2 Pin
YvesDaoust28-Jun-14 0:41
memberYvesDaoust28-Jun-14 0:41 
GeneralRe: My vote of 2 Pin
Timpalo30-Jun-14 8:35
memberTimpalo30-Jun-14 8:35 
GeneralMy vote of 5 Pin
MC197218-Oct-12 0:11
memberMC197218-Oct-12 0:11 
GeneralMy vote of 5 Pin
ThatsAlok27-Dec-11 20:33
memberThatsAlok27-Dec-11 20:33 
GeneralMy vote of 5 Pin
Najeeb Shaikh24-Dec-11 1:34
memberNajeeb Shaikh24-Dec-11 1:34 
GeneralFloating point representation Pin
dybs4-Oct-08 9:04
memberdybs4-Oct-08 9:04 
AnswerRe: Floating point representation Pin
John D. Cook4-Oct-08 9:14
memberJohn D. Cook4-Oct-08 9:14 
GeneralRe: Floating point representation Pin
dybs4-Oct-08 10:49
memberdybs4-Oct-08 10:49 
Generalthanks Pin
epitalon1-Oct-08 22:39
memberepitalon1-Oct-08 22:39 
GeneralAlready found a MUCH speedier way... Pin
Kochise1-Oct-08 6:41
memberKochise1-Oct-08 6:41 
GeneralTolerance Of Floating Point Is A Major Headache Pin
JeffBilkey30-Sep-08 13:09
memberJeffBilkey30-Sep-08 13:09 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
John D. Cook30-Sep-08 13:32
memberJohn D. Cook30-Sep-08 13:32 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
JeffBilkey30-Sep-08 13:45
memberJeffBilkey30-Sep-08 13:45 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
supercat930-Oct-08 10:14
membersupercat930-Oct-08 10:14 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
JeffBilkey30-Oct-08 12:40
memberJeffBilkey30-Oct-08 12:40 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
supercat930-Oct-08 13:08
membersupercat930-Oct-08 13:08 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
JeffBilkey30-Oct-08 18:05
memberJeffBilkey30-Oct-08 18:05 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
supercat931-Oct-08 5:04
membersupercat931-Oct-08 5:04 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
JeffBilkey31-Oct-08 13:52
memberJeffBilkey31-Oct-08 13:52 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
supercat931-Oct-08 16:32
membersupercat931-Oct-08 16:32 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
Damir Valiulin3-Nov-08 7:49
memberDamir Valiulin3-Nov-08 7:49 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
JeffBilkey3-Nov-08 10:33
memberJeffBilkey3-Nov-08 10:33 
GeneralRe: Tolerance Of Floating Point Is A Major Headache Pin
Damir Valiulin3-Nov-08 18:01
memberDamir Valiulin3-Nov-08 18:01 
GeneralA few minor points Pin
Andrew Phillips29-Sep-08 15:37
memberAndrew Phillips29-Sep-08 15:37 
GeneralRe: A few minor points Pin
shangomatic30-Sep-08 5:30
membershangomatic30-Sep-08 5:30 
GeneralRe: A few minor points Pin
wtwhite3-Nov-08 14:49
memberwtwhite3-Nov-08 14:49 
Some useful points. However I would contest the following two:

Andrew Phillips wrote:
1. You say "don't test for equality". This is only true if the result of a calculation or intermediate values cannot be represented exactly. In fact there are many situations where comparing for equality/inequality is useful and even necessary.

The only case that I can think of where you know that an FP answer will always be represented exactly is if you happen to be working with fractions that are multiples of a small (positive or negative) power of 2, and you restrict yourself to the operations +, - and *. In which case you might be better off (in terms of runtime efficiency) using an integer-based fixed point representation.

Andrew Phillips wrote:
5. I have never seen an algorithm that required using logarithms to avoid overflow or underflow.

I work with programs that perform maximum likelihood inference of phylogenetic (evolutionary) trees from DNA data. These programs (and probably all types of ML inference programs) perform computations using log-likelihoods instead of raw likelihoods, in part because the tiny probabilities resulting from typical problems far exceed the range provided by IEEE double-precision.

GeneralTake a look at these C++ floating point utilities Pin
Simon Hughes29-Sep-08 10:52
memberSimon Hughes29-Sep-08 10:52 
GeneralLogarithms express the floating number Pin
gbb2128-Sep-08 9:40
membergbb2128-Sep-08 9:40 
GeneralRe: Logarithms express the floating number Pin
John D. Cook28-Sep-08 9:59
memberJohn D. Cook28-Sep-08 9:59 
GeneralRe: Logarithms express the floating number Pin
gbb2128-Sep-08 10:04
membergbb2128-Sep-08 10:04 
GeneralExcellent reminder Pin
PatLeCat28-Sep-08 0:02
memberPatLeCat28-Sep-08 0:02 
GeneralIsnt float deprecated Pin
KarstenK25-Sep-08 1:11
memberKarstenK25-Sep-08 1:11 
AnswerRe: Isnt float deprecated Pin
John D. Cook25-Sep-08 6:20
memberJohn D. Cook25-Sep-08 6:20 
GeneralRe: Isnt float deprecated Pin
geoyar29-Sep-08 8:41
membergeoyar29-Sep-08 8:41 
GeneralRe: Isnt float deprecated Pin
harold aptroot29-Sep-08 8:39
memberharold aptroot29-Sep-08 8:39 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.170713.1 | Last Updated 26 Jun 2014
Article Copyright 2008 by John D. Cook
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid