Click here to Skip to main content
Click here to Skip to main content

Five Tips for Floating Point Programming

By , 30 Oct 2008
 

Introduction

There are several traps that even very experienced programmers fall into when they write code that depends on floating point arithmetic. This article explains five things to keep in mind when working with floating point numbers, i.e. float and double data types.

Don't Test for Equality

You almost never want to write code like the following:

double x;
double y;
...
if (x == y) {...}

Most floating point operations involve at least a tiny loss of precision and so even if two numbers are equal for all practical purposes, they may not be exactly equal down to the last bit, and so the equality test is likely to fail. For example, the following code snippet prints -1.778636e-015. Although in theory, squaring should undo a square root, the round-trip operation is slightly inaccurate.

double x = 10; 
double y = sqrt(x);
y *= y;
if (x == y)
    cout << "Square root is exact\n";
else
    cout << x-y << "\n";

In most cases, the equality test above should be written as something like the following:

double tolerance = ...
if (fabs(x - y) < tolerance) {...}

Here tolerance is some threshold that defines what is "close enough" for equality. This begs the question of how close is close enough. This cannot be answered in the abstract; you have to know something about your particular problem to know how close is close enough in your context.

Worry about Addition and Subtraction more than Multiplication and Division

The relative errors in multiplication and division are always small. Addition and subtraction, on the other hand, can result in complete loss of precision. Really the problem is subtraction; addition can only be a problem when the two numbers being added have opposite signs, so you can think of that as subtraction. Still, code might be written with a "+" that is really subtraction.

Subtraction is a problem when the two numbers being subtracted are nearly equal. The more nearly equal the numbers, the greater the potential for loss of precision. Specifically, if two numbers agree to n bits, n bits of precision may be lost in the subtraction. This may be easiest to see in the extreme: If two numbers are not equal in theory but they are equal in their machine representation, their difference will be calculated as zero, 100% loss of precision.

Here's an example where such loss of precision comes up often. The derivative of a function f at a point x is defined to be the limit of (f(x+h) - f(x))/h as h goes to zero. So a natural approach to computing the derivative of a function would be to evaluate (f(x+h) - f(x))/h for some small h. In theory, the smaller h is, the better this fraction approximates the derivative. In practice, accuracy improves for a while, but past some point smaller values of h result in worse approximations to the derivative. As h gets smaller, the approximation error gets smaller but the numerical error increases. This is because the subtraction f(x+h) - f(x) becomes problematic. If you take h small enough (after all, in theory, smaller is better) then f(x+h) will equal f(x) to machine precision. This means all derivatives will be computed as zero, no matter what the function, if you just take h small enough. Here's an example computing the derivative of sin(x) at x = 1.

cout << std::setprecision(15);
for (int i = 1; i < 20; ++i)
{
    double h = pow(10.0, -i);
    cout << (sin(1.0+h) - sin(1.0))/h << "\n";
}
cout << "True result: " << cos(1.0) << "\n";

Here is the output of the code above. To make the output easier to understand, digits after the first incorrect digit have been replaced with periods.

0.4...........
0.53..........
0.53..........
0.5402........
0.5402........
0.540301......
0.5403022.....
0.540302302...
0.54030235....
0.5403022.....
0.540301......
0.54034.......
0.53..........
0.544.........
0.55..........
0
0
0
0
True result: 0.54030230586814

The accuracy improves as h gets smaller until h = 10-8. Past that point, accuracy decays due to loss of precision in the subtraction. When h = 10-16 or smaller, the output is exactly zero because sin(1.0+h) equals sin(1.0) to machine precision. (In fact, 1+h equals 1 to machine precision. More on that below.)

(The results above were computed with Visual C++ 2008. When compiled with gcc 4.2.3 on Linux, the results were the same except of the last four numbers. Where VC++ produced zeros, gcc produced negative numbers: -0.017..., -0.17..., -1.7..., and 17....)

What do you do when your problem requires subtraction and it's going to cause a loss of precision? Sometimes the loss of precision isn't a problem; doubles start out with a lot of precision to spare. When the precision is important, it's often possible to use some trick to change the problem so that it doesn't require subtraction, or doesn't require the same subtraction that you started out with.

See the CodeProject article Avoiding Overflow, Underflow, and Loss of Precision for an example of using algebraic trickery to change the quadratic formula into form more suitable for retaining precision. See also comparing three methods of computing standard deviation for an example of how algebraically equivalent methods can perform very differently.

Floating Point Numbers have Finite Ranges

Everyone knows that floating point numbers have finite ranges, but this limitation can show up in unexpected ways. For example, you may find the output of the following lines of code surprising.

float f = 16777216; 
cout << f << " " << f+1 << "\n";

This code prints the value 16777216 twice. What happened? According to the IEEE specification for floating point arithmetic, a float type is 32 bits wide. Twenty four of these bits are devoted to the significand (what used to be called the mantissa) and the rest to the exponent. The number 16777216 is 224 and so the float variable f has no precision left to represent f+1. A similar phenomena would happen for 253 if f were of type double because a 64-bit double devotes 53 bits to the significand. The following code prints 0 rather than 1.

x = 9007199254740992; // 2^53
cout << ((x+1) - x) << "\n";

We can also run out of precision when adding small numbers to moderate-sized numbers. For example, the following code prints "Sorry!" because DBL_EPSILON (defined in float.h) is the smallest positive number e such that 1 + e != 1 when using double types.

x = 1.0;
y = x + 0.5*DBL_EPSILON;
if (x == y)
    cout << "Sorry!\n";

Similarly, the constant FLT_EPSILON is the smallest positive number e such that 1 + e is not 1 when using float types.

Use Logarithms to Avoid Overflow and Underflow

The limitations of floating point numbers described in the previous section stem from having a limited number of bits in the significand. Overflow and underflow result from also having a finite number of bits in the exponent. Some numbers are just too large or too small to store in a floating point number.

Many problems appear to require computing a moderate-sized number as the ratio of two enormous numbers. The final result may be representable as a floating point number even though the intermediate results are not. In this case, logarithms provide a way out. If you want to compute M/N for large numbers M and N, compute log(M) - log(N) and apply exp() to the result. For example, probabilities often involve ratios of factorials, and factorials become astronomically large quickly. For N > 170, N! is larger than DBL_MAX, the largest number that can be represented by a double (without extended precision). But it is possible to evaluate expressions such as 200!/(190! 10!) without overflow as follows:

x = exp( logFactorial(200) 
       - logFactorial(190) 
       - logFactorial(10) );

A simple but inefficient logFactorial function could be written as follows:

double logFactorial(int n)
{
    double sum = 0.0;
    for (int i = 2; i <= n; ++i)
        sum += log((double)i);
    return sum;
}

A better approach would be to use a log gamma function if one is available. See How to calculate binomial probabilities for more information.

Numeric Operations don't Always Return Numbers

Because floating point numbers have their limitations, sometimes floating point operations return "infinity" as a way of saying "the result is bigger than I can handle." For example, the following code prints 1.#INF on Windows and inf on Linux.

x = DBL_MAX;
cout << 2*x << "\n";

Sometimes the barrier to returning a meaningful result has to do with logic rather than finite precision. Floating point data types represent real numbers (as opposed to complex numbers) and there is no real number whose square is -1. That means there is no meaningful number to return if code requests sqrt(-2), even in infinite precision. In this case, floating point operations return NaNs. These are floating point values that represent error codes rather than numbers. NaN values display as 1.#IND on Windows and nan on Linux.

Once a chain of operations encounters a NaN, everything is a NaN from there on out. For example, suppose you have some code that amounts to something like the following:

if (x - x == 0)
    // do something

What could possibly keep the code following the if statement from executing? If x is a NaN, then so is x - x and NaNs don't equal anything. In fact, NaNs don't even equal themselves. That means that the expression x == x can be used to test whether x is a (possibly infinite) number. For more information on infinities and NaNs, see IEEE floating point exceptions in C++.

For More Information

The article What Every Computer Scientist Should Know About Floating-Point Arithmetic explains floating point arithmetic in great detail. It may be what every computer scientist would know ideally, but very few will absorb everything presented there.

History

  • 24th September, 2008: Original post
  • 29th October, 2008: Added reference, modified code to also compile with gcc, reported VC++ vs gcc difference in one example

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

John D. Cook
United States United States
Member
I am an independent consultant in software development and applied mathematics. I help companies learn from their data to make better decisions.
 
Check out my blog or send me a note.
 

 


Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
Questionavoid devision by 0 and NaNmemberpip0108 May '13 - 5:44 
what if I want to make any double/float which is 0 to be a very small number instead. otherwise we will end up with NaN or exception.
 
is this "good" : double verySmallNumber = 1.0D / 1.0e+14;
GeneralMy vote of 2memberJuan Falgueras Cano18 Apr '13 - 1:01 
This comments do not consider nor Ruby (that seems to have no many of the exposed problems) nor expose samples of alternative solutions
GeneralMy vote of 5memberMC197218 Oct '12 - 0:11 
Thank you !
GeneralMy vote of 5memberThatsAlok27 Dec '11 - 20:33 
Though i am late, put this article deserve 5
GeneralMy vote of 5memberNajeeb Shaikh24 Dec '11 - 1:34 
Excellent, and very relevant to what I'm currently doing. Thanks a lot for this article!
GeneralFloating point representationmemberdybs4 Oct '08 - 9:04 
I know floating point numbers are not always exact, but why is it that when I'm stepping through the Visual Studio debugger (using 2005), when I should see 125.10000000000000, I instead see either 125.10000000000002 or 125.99999999999998? Is this just the nature of the precision of floating point numbers?
 
Thanks,
 
Brandon
AnswerRe: Floating point representationmemberJohn D. Cook4 Oct '08 - 9:14 
The key is that 0.1 has an infinite binary representation. The base 10 representation is finite, so it's understandable to think it can be represented exactly in a computer, but it cannot. The computer uses binary, and so some precision is lost in representing 0.1 internally.
GeneralRe: Floating point representationmemberdybs4 Oct '08 - 10:49 
Good point, that makes good sense. Thanks!
 
Brandon
Generalthanksmemberepitalon1 Oct '08 - 22:39 
Thanks for the clarifications.
 
Though I know this already approximately, this article is very clear and helpfull.
 
Jean-Marie, Software developper, France
GeneralAlready found a MUCH speedier way...memberKochise1 Oct '08 - 6:41 
http://www.codeproject.com/KB/tips/FloatingPointEquality.aspx?msg=1994707#xx1994707xx
 
Nice articles BTW...
 
Kochise
 
In Code we trust !

GeneralTolerance Of Floating Point Is A Major HeadachememberJeffBilkey30 Sep '08 - 13:09 
Thanks for your article. I have a specialized CAD package for drawing up roofs and optimizing the roofing. I have always said the tolerance is my biggest single problem. For example it is theoretically simple to test if two lines are parallel - just compare the slope. However the tolerance is different if the lines are horizontal or if they are nearing the vertical. The most successful way to test for parallelism I have found is to test for the distance between the ends of one line projected onto the other. If the difference in the distance is < 2mm or whatever tolerance I set, I call it parallel.
 
Also the quadratic formula
(-b +-sqrt(b*b - 4ac))/2a
I have found to be a complete waste of time in CAD. As the
(b*b - 4ac)
approaches zero, any tolerance problem means that even using long double can lead to erroneous results. I wouldn't advise anyone to use this formula in computing unless they are in complete control of the input.
GeneralRe: Tolerance Of Floating Point Is A Major HeadachememberJohn D. Cook30 Sep '08 - 13:32 
You might be interested in the last chapter of the book "Beautiful Code." Brian Hayes looks in detail at the kind of geometry problems you mention.
GeneralRe: Tolerance Of Floating Point Is A Major HeadachememberJeffBilkey30 Sep '08 - 13:45 
Thanks, I was not aware of it. The overview looks interesting.
GeneralRe: Tolerance Of Floating Point Is A Major Headachemembersupercat930 Oct '08 - 10:14 
I have always said the tolerance is my biggest single problem. For example it is theoretically simple to test if two lines are parallel - just compare the slope.
 
What about using the cross product? Given vectors (x1,y1) and (x2,y2), one can determine whether magnitude of the angle is above a certain amount by checking whether
 
(x1*y2-x2*y1)^2 > ((x1*x1 + y1*y1) * (x2*x2 + y2*y2)) * sin(angle)^2. The sine need only be computed once, so the check requires nine multiplications and three additions. No divisions or square roots. Only for very small angles would the loss of precision on the subtraction be a problem.
GeneralRe: Tolerance Of Floating Point Is A Major HeadachememberJeffBilkey30 Oct '08 - 12:40 
I will have to test it. I have found that cross products work well with integer coordinates but fail on floating point due to previously stated reasons. I take your point though and will test as you appear to be doing less computation than what I am doing using Pythagoras. Thanks.
GeneralRe: Tolerance Of Floating Point Is A Major Headachemembersupercat930 Oct '08 - 13:08 
When the goal is to determine whether a formula evaluates to something larger or smaller than a certain threshold, it is very frequently useful to rewrite the equality to minimize computations. An extremely common example of that is checking the length of a vector by seeing whether (x^2+y^2) > l^2 (where l is the desired length), instead of computing sqrt(x^2 + y^2). The cross-product inequality I gave is conceptually similar, though a little trickier. Since the cross product yields the sine of the angle times the product of the two lengths, and since length^2 is easier to compute than length, squaring both sides of the inequality simplifies the math.
 
As for numerical precision, the only subtraction is of two numbers whose maximum magnitude will be proportional to the product of the length of the vectors, and whose difference will be proportional to that quantity times the sine of the angle. The only cases where significant precision will be lost are those where the angle is very small, and the loss of precision will still leave a very small angle.
GeneralRe: Tolerance Of Floating Point Is A Major HeadachememberJeffBilkey30 Oct '08 - 18:05 
supercat9 you sound like a pure mathematician, fyi I am an engineer, an applied mathematician (I think!). My problem: I have software that draws houses and buildings on which we place roofing so it can be optimized and estimated. If a user nominates a ridge line then the line in the same polygon parallel to it and in the opposite direction is an eave, so the software should automatically add a gutter to that eave line. If it is at an angle to the ridge it is a Rake Edge(or Barge or Verge depending on which country you live in) and therefore no gutter (usually). Now as buildings are generally drawn of rectangular shape around 50% of the vectors (assuming 180 degrees is a small angle) fall into your exception case of small angles. In fact the to quote you "the only case where significant precision is lost" is the norm. Ok the above is a simple problem where a wide tolerance is fine, however when doing this calculation, the program may do several others where a wide tolerance is inappropriate. Believe you end up going round in circles. Don't get me wrong, I appreciate your comments and will test them. I have another problem to test for a vector to vertical to another, I believe your solution will turn out better than mine. Thanks.
GeneralRe: Tolerance Of Floating Point Is A Major Headachemembersupercat931 Oct '08 - 5:04 
I consider myself more of an engineer than a mathematician, but I dabble a little in everything. One of my hobbies is 6502 programming. If your goal is to decide whether two lines are close enough to parallel that an eave will be horizontal, the approach I gave should be more than adequate. If you were trying to distinguish between lines that are 0.000000018 degrees from parallel 0.000000018 and those that were only 0.000000014 degrees from parallel, then you might run into problems with floating point precision. On the other hand, I would expect that you would probably to regard any line that was within 0.01 degrees as parallel (and probably accept a lot looser tolerances than that), so numerical precision should not be an issue.
GeneralRe: Tolerance Of Floating Point Is A Major HeadachememberJeffBilkey31 Oct '08 - 13:52 
I trust we are not boring the rest of world with this to-and-fro. What you say is correct in the pure mathematical world. However we are drawing using CAD engines that are designed for building construction. Building construction accuracy would be to the nearest 1/2 inch (or 10mm where I come from). Hence the CAD engine does not save coordinates to anything like double precision, there is no need. So when we draw the line for the first time we are in double precision, but when you save the drawing and open it again, you have a slightly different coordinate for the line based on the precision of the CAD engine. You can imagine it being a little embarrassing if on saving and opening a drawing, then doing a re-calculation, you get a different answer to what you had when you first drew it. Also, digressing from computer tolerance, and into the general problem of tolerance, if someone is doing a re-roof then they measure using a tape and are really lucky if they come back with a measurement within a couple of inches. Hence you have a whole new problem with tolerance. A tolerance of 5% is meaningless, you have to come back to some numerical tolerance of say 2 inches as acceptable, or in your case a tolerance of 5 degrees. Both of which are understandable and relate to the real world.
GeneralRe: Tolerance Of Floating Point Is A Major Headachemembersupercat931 Oct '08 - 16:32 
[quote]A tolerance of 5% is meaningless, you have to come back to some numerical tolerance of say 2 inches as acceptable, or in your case a tolerance of 5 degrees.[/quote]
 
Well, what's necessary is to define requirements so that, e.g., an angle over 5.010 degrees is guaranteed to be considered 'not parallel', an angle of less than 4.990 is considered parallel, and anything between may be arbitrarily considered parallel or not.
 
No matter what method you use, there will be some possible case where a couple lines are right on the threshold of being called parallel and their status will be changed by things which shouldn't affect it. If that is a problem, it may be solved by adding hysteresis (keep track of whether two lines are believed to be parallel; if they're believed to be parallel and they measure more than 5.005 degrees, consider them to no longer be parallel; if they weren't parallel but now they're within 4.995 degrees, consider them parallel. That sort of approach will avoid surprises if the status of line pairs is preserved when they are saved, copied, etc. If the status values are not saved, the hysteresis may cause confusion.
GeneralRe: Tolerance Of Floating Point Is A Major HeadachememberDamir Valiulin3 Nov '08 - 7:49 
Jeff,
 
I also work with CAD and battle tolerance issues on daily basis. Over the years I've learned to stick in tolerance checks even in places I can't imagine that would need it. Proper approach to this problem would be to define acceptable tolerance up front - in your example 1-2mm and stick to it. So, if your units of measurement are mm and I'm assuming maximum problem dimensions do not exceed 100m, then you have about 7 digits of precision that you need to maintain. Any two values that are different after 7th digit would be considered the same for practical reasons.
 
I'd also store the data in file with at least 2-3 additional precision digits. So, in your case again, you'd want to store values with precision of up to a 1/100 of a mm. Then your results will always be identical on save/load.
 
For finding parallel lines, I use cross product as supercat9 mentioned:
 
	ux = x2-x1;
	uy = y2-y1;
	vx = x4-x3;
	vy = y4-y3;
	
	cp = ux*vy - uy*vx; //cross product
	if (cp<some_tolerance){>
		//lines are parallel
	}
 
There are lots of published efficient algorithms for line intersections, overlaps, etc. If you are building a serious CAD package you should definitely research this. I'd give some references, but I'm too lazy to check right now. Let me know if you need it.
 
Regards,
Damir
GeneralRe: Tolerance Of Floating Point Is A Major HeadachememberJeffBilkey3 Nov '08 - 10:33 
Thanks Damir. It is good to hear someone else is in the same boat. However I have found one single tolerance not good enough. For example by mathematics I deduce a line is parallel, so the line becomes an eave because it is parallel to a ridge, but it just so happens the mathematics is right on the tolerance limit. Then another line is drawn beside the eave which means half the line changes to another attribute so I need to split the eave line into two lines where one portion remains the eave. Now in splitting the line it changes the coordinates just enough so that it is OVER the tolerance limit and the portion that was an eave, and still should be, is no longer an eave because it is not longer parallel. Grrr. I found this by the program crashing but of course it only happened sometimes. A really nasty bug to find. So I introduced a greater tolerance at the second level of calculation so the program remains consistent. Storing an attribute is not really suitable as supercat suggested because in a another example changing the eave to something else IS correct. I am chasing my tail. So what I have is several levels of tolerance, for better or worse.
GeneralRe: Tolerance Of Floating Point Is A Major HeadachememberDamir Valiulin3 Nov '08 - 18:01 
As much as I tried to draw all this in my head, it's hard to understand it without a good picture. Smile | :)
 
You are correct about using different tolerance values for different comparisons. Disregard previous post about sticking to a single value for all operations. However, generally you would use the same tolerance value for same type of comparisons.
 
Also, if inserting a vertex on the line makes the two resultant segments non-parallel, then either you are doing something wrong with insertion algorithm or lines parallel algorithm. Try using cross-product check I suggested. Just a correction there: if (fabs(cp)<some_tolerance)> because cp can be negative.
 
Regards,
Damir Valiulin
GeneralA few minor pointsmemberAndrew Phillips29 Sep '08 - 15:37 
I applaud your article for helping us to create better software. However, I disagree on your emphasis in some cases. I guess the problems that one is likely to encounter depends on the application domain - my experience is mainly with geometrical algorithms.
 

1. You say "don't test for equality". This is only true if the result of a calculation or intermediate values cannot be represented exactly. In fact there are many situations where comparing for equality/inequality is useful and even necessary.
 

2. You say that floating point operations are inaccurate. The real problem is that the exact values cannot be represented exactly by the hardware. In your example, the square root of 10 is an irrational number that cannot be stored exactly. But if you changed the code to obtain the square root of 9 instead then the code will work as was expected.
 
double x = 9; 
double y = sqrt(x);
y *= y;
 

3. You recommend replacing an equality test with an approximate equality test. I agree that an approximate equality test is useful sometimes (eg when values depend on user input). However, in many cases the problem is with the algorithm, which should be redesigned to avoid the test altogether.
 

4. You say not to worry about multiplication and division. However, loss of significance can occur just as readily for those, as for example when dividing two large numbers.
 
I have often seen code to find the slope of a line that uses atan(dy/dx); which has large loss of significance for near vertical lines. Of course, it is better to use atan2(dy, dx).
 

5. I have never seen an algorithm that required using logarithms to avoid overflow or underflow. The exponent for double gives an enormous range of values. I think this is a very unlikely problem which could be solved by redesign of the algorithm to avoid it.
 

6. There is no discussion of cumulative errors which is a common mistake when using iterative algorithms.
 

7. Possibly the most common problem for beginners is using floating point numbers merely for the convenience of output formatting (and literals in the code). The solution is to use integers for calculations then add decimal places when the results are displayed.
 

8. The discussion assumes that float/double use IEEE representations. The float and double types in C/C++ do not specify any specific implementation. I know that almost everyone reading this will be using compilers that use IEEE implementations but a caveat that when you say float you mean "IEEE 32-bit floating point" etc, at the top of the article might be nice.
 
Andrew Phillips
http://www.hexedit.com
andrew @ hexedit.com

GeneralRe: A few minor pointsmembershangomatic30 Sep '08 - 5:30 
Andrew Phillips wrote:
5. I have never seen an algorithm that required using logarithms to avoid overflow or underflow. The exponent for double gives an enormous range of values. I think this is a very unlikely problem which could be solved by redesign of the algorithm to avoid it.

 
just recently I needed to resort to logarithms to avoid overflow in the evaluation of the probability density function of the Beta distribution [1] with 'a' and 'b' parameters in the thousands. These values don't work in the exponents without overflowing, so instead of
pow(x,a-1.0) * pow((1.0-x),b-1.0) / beta(a,b)

I used
exp( (a-1.0)*log(x) + (b-1.0)*log(1.0-x) - betaln(a,b) )

 
Just an example.
[1] http://en.wikipedia.org/wiki/Beta_distribution[^]
GeneralRe: A few minor pointsmemberwtwhite3 Nov '08 - 14:49 
Some useful points. However I would contest the following two:
 
Andrew Phillips wrote:
1. You say "don't test for equality". This is only true if the result of a calculation or intermediate values cannot be represented exactly. In fact there are many situations where comparing for equality/inequality is useful and even necessary.

 
The only case that I can think of where you know that an FP answer will always be represented exactly is if you happen to be working with fractions that are multiples of a small (positive or negative) power of 2, and you restrict yourself to the operations +, - and *. In which case you might be better off (in terms of runtime efficiency) using an integer-based fixed point representation.
 
Andrew Phillips wrote:
5. I have never seen an algorithm that required using logarithms to avoid overflow or underflow.

 
I work with programs that perform maximum likelihood inference of phylogenetic (evolutionary) trees from DNA data. These programs (and probably all types of ML inference programs) perform computations using log-likelihoods instead of raw likelihoods, in part because the tiny probabilities resulting from typical problems far exceed the range provided by IEEE double-precision.
 
WTJW
GeneralTake a look at these C++ floating point utilitiesmemberSimon Hughes29 Sep '08 - 10:52 
You may find these of interest.
 
http://www.codeproject.com/KB/cpp/floatutils.aspx[^]
 
Regards,
Simon Hughes

GeneralLogarithms express the floating numbermembergbb2128 Sep '08 - 9:40 
I think it would increase the error of calculation.
 
I don't think the logarithm could decrease the bits needed to express a number.
GeneralRe: Logarithms express the floating numbermemberJohn D. Cook28 Sep '08 - 9:59 
You are right. Taking logarithms does not improve precision, but it can prevent underflow or overflow. If you're able to compute your result without taking logs and you don't overflow or underflow, you're better off. But if you do overflow or underflow, you've lost all precision.
GeneralRe: Logarithms express the floating numbermembergbb2128 Sep '08 - 10:04 
Got it, thanksSmile | :)
GeneralExcellent remindermemberPatLeCat28 Sep '08 - 0:02 
That was really useful and in some parts insightful. As always with C++ you never stop learningCool | :cool:
Thanks mate.
GeneralIsnt float deprecatedmemberKarstenK25 Sep '08 - 1:11 
I use double as successor for float because of its broader range. I dont know of floating point optimizations of processors?
 
And arent data types and size on 64bit OS different?
 
Greetings from Germany

AnswerRe: Isnt float deprecatedmemberJohn D. Cook25 Sep '08 - 6:20 
There's not much reason to use floats any more, but one reason would be to conserve memory. Not usually an issue, but an application I had some connection to was memory-bound and could solve larger problems by storing values as floats. I wrote about that application here http://www.codeproject.com/KB/recipes/TailKeeper.aspx[^]
 
The IEEE 754 standard defines exactly what a float and double are, down to the bit, though an OS can offer extensions. So you can write portable code that makes low-level assumptions about the layouts of floats and doubles.
GeneralRe: Isnt float deprecatedmembergeoyar29 Sep '08 - 8:41 
One reason to use float is an app with Gdiplus. The Gdiplus functions use REAL that is alias for float (for now)
like RectF rectF(2.0f, 1.0f, 10.0f, 10.9f). Simple RectF rectF(2.0, 1.0, 10.0, 10.9)evokes warnings.
 
geoyar

GeneralRe: Isnt float deprecatedmemberharold aptroot29 Sep '08 - 8:39 
Floats and doubles are the same in long mode
Calculations using floats can be up to twice as fast as with doubles (if using SSE) - if using the FPU, it only really matters for fsqrt, fdiv and fidiv, their throughput depends on the precision bits of the FPU control word (which may not be used by the compiler)
On many systems, moving a double is not atomic whereas moving a float usually is (rarely a problem, but it can be)
 
Often you don't need doubles, just like often you don't need longs - I would only use doubles where they're necessary
In real-time graphics doubles are usually huge overkill and the double SSE performance is really needed so floats are used.
GeneralRe: Isnt float deprecatedmemberKarstenK30 Oct '08 - 21:24 
I dont compute much, so the double give me more accuracy. Blush | :O
(Im coding more MS-GUI and sometimes ownerdraw)
 
Greetings from Germany

GeneralRe: Isnt float deprecatedmemberharold aptroot31 Oct '08 - 3:06 
It wouldn't matter much in that case but..
Think about whether this accuracy is significant - depending on the calculation, it may not be. Although it doesn't really matter in this case, it's like using longs all the time (and why would you do that?)
GeneralRe: Isnt float deprecatedmemberKarstenK31 Oct '08 - 3:57 
if there some loops with divisions and after that some additions or multiplications, and the fractuals are adding. And I need accuracy of int.
 
So I compute with doubles and cast at last. Cool | :cool:
 
Greetings from Germany

GeneralRe: Isnt float deprecatedmemberharold aptroot31 Oct '08 - 4:09 
well, if you need 32 bits of accuracy you'd have to use a double yes..
GeneralRe: Isnt float deprecatedmemberDon Clugston29 Sep '08 - 22:57 
Floats take up half as much memory as doubles. For very large arrays, you'll be dominated by memory access time, so floats will be faster.
Otherwise, use doubles.
GeneralRe: Isnt float deprecatedmembersupercat930 Oct '08 - 10:16 
Microsoft uses float for many of their graphics-related routines in .net; rather annoying since with "strict on" (usually helpful) the compiler requires explicit casts when passing doubles to graphics routines.
GeneralGood article some useful reminders!memberMike Diack24 Sep '08 - 21:35 
I'd forgotten some of this (apart from the infamous "don't check for equality" issue).
 
Thanks very much,
 
Mike

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130516.1 | Last Updated 30 Oct 2008
Article Copyright 2008 by John D. Cook
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid