|
|||||||||||||||||||||
|
|||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionThere are several traps that even very experienced programmers fall into when they write code that depends on floating point arithmetic. This article explains five things to keep in mind when working with floating point numbers, i.e. Don't Test for EqualityYou almost never want to write code like the following: double x;
double y;
...
if (x == y) {...}
Most floating point operations involve at least a tiny loss of precision and so even if two numbers are equal for all practical purposes, they may not be exactly equal down to the last bit, and so the equality test is likely to fail. For example, the following code snippet prints double x = 10;
double y = sqrt(x);
y *= y;
if (x == y)
cout << "Square root is exact\n";
else
cout << x-y << "\n";
In most cases, the equality test above should be written as something like the following: double tolerance = ...
if (fabs(x - y) < tolerance) {...}
Here Worry about Addition and Subtraction more than Multiplication and DivisionThe relative errors in multiplication and division are always small. Addition and subtraction, on the other hand, can result in complete loss of precision. Really the problem is subtraction; addition can only be a problem when the two numbers being added have opposite signs, so you can think of that as subtraction. Still, code might be written with a "+" that is really subtraction. Subtraction is a problem when the two numbers being subtracted are nearly equal. The more nearly equal the numbers, the greater the potential for loss of precision. Specifically, if two numbers agree to n bits, n bits of precision may be lost in the subtraction. This may be easiest to see in the extreme: If two numbers are not equal in theory but they are equal in their machine representation, their difference will be calculated as zero, 100% loss of precision. Here's an example where such loss of precision comes up often. The derivative of a function cout << std::setprecision(15);
for (int i = 1; i < 20; ++i)
{
double h = pow(10.0, -i);
cout << (sin(1.0+h) - sin(1.0))/h << "\n";
}
cout << "True result: " << cos(1.0) << "\n";
Here is the output of the code above. To make the output easier to understand, digits after the first incorrect digit have been replaced with periods. 0.4...........
0.53..........
0.53..........
0.5402........
0.5402........
0.540301......
0.5403022.....
0.540302302...
0.54030235....
0.5403022.....
0.540301......
0.54034.......
0.53..........
0.544.........
0.55..........
0
0
0
0
True result: 0.54030230586814
The accuracy improves as (The results above were computed with Visual C++ 2008. When compiled with gcc 4.2.3 on Linux, the results were the same except of the last four numbers. Where VC++ produced zeros, gcc produced negative numbers: -0.017..., -0.17..., -1.7..., and 17....) What do you do when your problem requires subtraction and it's going to cause a loss of precision? Sometimes the loss of precision isn't a problem; See the CodeProject article Avoiding Overflow, Underflow, and Loss of Precision for an example of using algebraic trickery to change the quadratic formula into form more suitable for retaining precision. See also comparing three methods of computing standard deviation for an example of how algebraically equivalent methods can perform very differently. Floating Point Numbers have Finite RangesEveryone knows that floating point numbers have finite ranges, but this limitation can show up in unexpected ways. For example, you may find the output of the following lines of code surprising. float f = 16777216;
cout << f << " " << f+1 << "\n";
This code prints the value x = 9007199254740992; // 2^53
cout << ((x+1) - x) << "\n";
We can also run out of precision when adding small numbers to moderate-sized numbers. For example, the following code prints " x = 1.0;
y = x + 0.5*DBL_EPSILON;
if (x == y)
cout << "Sorry!\n";
Similarly, the constant Use Logarithms to Avoid Overflow and UnderflowThe limitations of floating point numbers described in the previous section stem from having a limited number of bits in the significant. Overflow and underflow result from also having a finite number of bits in the exponent. Some numbers are just too large or too small to store in a floating point number. Many problems appear to require computing a moderate-sized number as the ratio of two enormous numbers. The final result may be representable as a floating point number even though the intermediate results are not. In this case, logarithms provide a way out. If you want to compute x = exp( logFactorial(200)
- logFactorial(190)
- logFactorial(10) );
A simple but inefficient double logFactorial(int n)
{
double sum = 0.0;
for (int i = 2; i <= n; ++i)
sum += log((double)i);
return sum;
}
A better approach would be to use a log gamma function if one is available. See How to calculate binomial probabilities for more information. Numeric Operations don't Always Return NumbersBecause floating point numbers have their limitations, sometimes floating point operations return "infinity" as a way of saying "the result is bigger than I can handle." For example, the following code prints x = DBL_MAX;
cout << 2*x << "\n";
Sometimes the barrier to returning a meaningful result has to do with logic rather than finite precision. Floating point data types represent real numbers (as opposed to complex numbers) and there is no real number whose square is Once a chain of operations encounters a if (x - x == 0)
// do something
What could possibly keep the code following the For More InformationThe article What Every Computer Scientist Should Know About Floating-Point Arithmetic explains floating point arithmetic in great detail. It may be what every computer scientist would know ideally, but very few will absorb everything presented there. History
| ||||||||||||||||||||