Click here to Skip to main content
Click here to Skip to main content

Floating point utilites

By , 17 Nov 2003
 

Introduction

This is a set of floating point utilities. 16 functions are provided:

  • FloatsEqual Testing float's for equality. When the operands of operators == and != are some form of floating type (float, double, or long double). Testing for equality between two floating point quantities is suspect because of round-off error and the lack of perfect representation of fractions.
  • Round Rounds a number to a specified number of digits.
  • RoundDouble Similar to Round() above, but uses double's instead of float's.
  • SigFig Rounds a number to a specified number of significant figures.
  • FloatToText Converts a floating point number to ascii (without the appended zeros)
  • CalcBase This function wraps the given number so that it remains within its base. Returns a number between 0 and base - 1. For example if the base given was 10 and the parameter was 10 it would wrap it so that the number is now a 0. If the number given were -1, then the result would be 9. This function can also be used everywhere where a number needs to be kept within a certain range, for example angles (0 and 360) and radians (0 to TWO_PI).
  • CalcBaseFloat Same as CalcBase() above, except using floats
  • Angle Make sure angle is between 0 and 359
  • LineLength Calculates the length of a line between the following two points
  • RoundValue Converts a floating point value to an integer, very fast
  • FloatToInt Converts a floating point value to an integer, very fast
  • FP_INV This is about 2.12 times faster than using 1.0f / n
  • CheckRange Makes sure Value is within range
  • CheckMin Makes sure Value is >= Min
  • CheckMax Makes sure Value is <= Max
  • Divide Performs a safe division

The credit for Round() and RoundDouble() goes to Josef Wolfsteiner.

Modifications

  • Simon Hughes, 18th November 2003.
    • Updated SigFig() to check for 0.0 being passed in as the value, as log10f(0) returns NaN
    • Added FloatsEqual() function
    • Added CalcBase() function
    • Added CalcBaseFloat() function
    • Added Angle() function
    • Added LineLength() function
    • Modified RoundValue() function so it is much faster
    • Added FloatToInt()
    • Added FP_INV for very fast 1/n calculations
    • Added CheckRange(), CheckMin(), CheckMax(), Divide() template functions

Header

// Testing float's for equality. When the operands of operators == and != are
// some form of floating type (float, double, or long double).  Testing for
// equality between two floating point quantities is suspect because of
// round-off error and the lack of perfect representation of fractions.
// The value here is for testing two float values are equivalent within the
// range shown here. The implementation is:
//     if(fabs(a - b) > float_equality) ...
// See FloatsEqual(a, b) function
#define float_equality 1.0e-20f
bool FloatsEqual(const float &a, const float &b);

// Rounds a number to a specified number of digits.
// Number is the number you want to round.
// Num_digits specifies the number of digits to which you want to round number.
// If num_digits is greater than 0, then number is rounded to the 
// specified number of decimal 

places.
// If num_digits is 0, then number is rounded to the nearest integer.
// Examples
//        ROUND(2.15, 1)        equals 2.2
//        ROUND(2.149, 1)        equals 2.1
//        ROUND(-1.475, 2)    equals -1.48
float Round(const float &number, const int num_digits);
double RoundDouble(double doValue, int nPrecision);

// Rounds X to SigFigs significant figures.
// Examples
//        SigFig(1.23456, 2)        equals 1.2
//        SigFig(1.23456e-10, 2)    equals 1.2e-10
//        SigFig(1.23456, 5)        equals 1.2346
//        SigFig(1.23456e-10, 5)    equals 1.2346e-10
//        SigFig(0.000123456, 2)    equals 0.00012
float SigFig(float X, int SigFigs);

// Converts a floating point number to ascii (without the appended 0's)
// Rounds the value if nNumberOfDecimalPlaces >= 0
CString FloatToText(float n, int nNumberOfDecimalPlaces = -1);

// This function wraps the given number so that it remains within its 
// base. Returns a number between 0 and base - 1.
// For example if the base given was 10 and the parameter was 10 it
// would wrap it so that the number is now a 0. If the number given
// were -1, then the result would be 9. This function can also be
// used everywhere where a number needs to be kept within a certain
// range, for example angles (0 and 360) and radians (0 to TWO_PI).
int CalcBase(const int base, int num);
// Same as CalcBase() above, except using floats
float CalcBaseFloat(const float base, float num);
// Make sure angle is between 0 and 359
int Angle(const int &angle);

// Calculates the length of a line between the following two points
float LineLength(const CPoint &point1, const CPoint &point2);

//lint -save -e*
// Converts a floating point value to an integer, very fast.
inline int RoundValue(float param)
{
    // Uses the FloatToInt functionality
    int a;
    int *int_pointer = &a;

    __asm  fld  param
    __asm  mov  edx,int_pointer
    __asm  FRNDINT
    __asm  fistp dword ptr [edx];

    return a;
}
//lint -restore

// At the assembly level the recommended workaround for the second 
// FIST bug is the same for the first; 
// inserting the FRNDINT instruction immediately preceding the 
// FIST instruction. 
// lint -e{715}
// Converts a floating point value to an integer, very fast.
inline void FloatToInt(int *int_pointer, const float &f) 
{
    __asm  fld  f
    __asm  mov  edx,int_pointer
    __asm  FRNDINT
    __asm  fistp dword ptr [edx];
}

// This is about 2.12 times faster than using 1.0f / n
// r = 1/p
#define FP_INV(r,p) \
{ \
    int _i = 2 * 0x3F800000 - *(int *)&(p); \
    (r) = *(float *)&_i; \
    (r) = (r) * (2.0f - (p) * (r)); \
}

// Makes sure Var is within range
template<CLASS T>
void CheckRange(T &Var, const T &Min, const T &Max)
{
    if(Var < Min)
        Var = Min;
    else
        if(Var > Max)
            Var = Max;
}

// Makes sure Var is >= Min
template<CLASS T>
void CheckMin(T &Var, const T &Min)
{
    if(Var < Min)
        Var = Min;
}

// Makes sure Var is <= Max
template<CLASS T>
void CheckMax(T &Var, const T &Max)
{
    if(Var > Max)
        Var = Max;
}

// Performs a safe division. Checks that b is not zero before division.
template<CLASS T>
inline T Divide(const T &a, const T &b)    

Source code

// Rounds a number to a specified number of digits.
// Number is the number you want to round.
// Num_digits specifies the number of digits to which you want 
// to round number.
// If num_digits is greater than 0, then number is rounded 
// to the specified number of decimal 

places.
// If num_digits is 0, then number is rounded to the nearest integer.
// Examples
//        ROUND(2.15, 1)        equals 2.2
//        ROUND(2.149, 1)        equals 2.1
//        ROUND(-1.475, 2)    equals -1.48
float Round(const float &number, const int num_digits)
{
    float doComplete5i, doComplete5(number * powf(10.0f, (float) (num_digits + 1)));
    
    if(number < 0.0f)
        doComplete5 -= 5.0f;
    else
        doComplete5 += 5.0f;
    
    doComplete5 /= 10.0f;
    modff(doComplete5, &doComplete5i);
    
    return doComplete5i / powf(10.0f, (float) num_digits);
}

double RoundDouble(double doValue, int nPrecision)
{
    static const double doBase = 10.0;
    double doComplete5, doComplete5i;
    
    doComplete5 = doValue * pow(doBase, (double) (nPrecision + 1));
    
    if(doValue < 0.0)
        doComplete5 -= 5.0;
    else
        doComplete5 += 5.0;
    
    doComplete5 /= doBase;
    modf(doComplete5, &doComplete5i);
    
    return doComplete5i / pow(doBase, (double) nPrecision);
}

// Rounds X to SigFigs significant figures.
// Examples
//        SigFig(1.23456, 2)        equals 1.2
//        SigFig(1.23456e-10, 2)    equals 1.2e-10
//        SigFig(1.23456, 5)        equals 1.2346
//        SigFig(1.23456e-10, 5)    equals 1.2346e-10
//        SigFig(0.000123456, 2)    equals 0.00012
float SigFig(float X, int SigFigs)
{
    if(SigFigs < 1)
    {
        ASSERT(FALSE);
        return X;
    }

    // log10f(0) returns NaN
    if(X == 0.0f)
        return X;
    
    int Sign;
    if(X < 0.0f)
        Sign = -1;
    else
        Sign = 1;

    X = fabsf(X);
    float Powers = powf(10.0f, floorf(log10f(X)) + 1.0f);

    return Sign * Round(X / Powers, SigFigs) * Powers;
}

// Converts a floating point number to ascii (without the appended 0's)
// Rounds the value if nNumberOfDecimalPlaces >= 0
CString FloatToText(float n, int nNumberOfDecimalPlaces)
{
    CString str;

    if(nNumberOfDecimalPlaces >= 0)
    {
        int decimal, sign;
        char *buffer = _fcvt((double)n, nNumberOfDecimalPlaces, &decimal, &sign);

        CString temp(buffer);
        
        // Sign for +ve or -ve
        if(sign != 0)
            str = "-";

        // Copy digits up to decimal point
        if(decimal <= 0)
        {
            str += "0.";
            for(; decimal < 0; decimal++)
                str += "0";
            str += temp;
        } else {
            str += temp.Left(decimal);
            str += ".";
            str += temp.Right(temp.GetLength() - decimal);
        }
    } else {
        str.Format("%-g", n);
    }

    // Remove appended zero's. "123.45000" become "123.45"
    int nFind = str.Find(".");
    if(nFind >= 0)
    {
        int nFinde = str.Find("e");    // 1.0e-010 Don't strip the ending zero
        if(nFinde < 0)
        {
            while(str.GetLength() > 1 && str.Right(1) == "0")
                str = str.Left(str.GetLength() - 1);
        }
    }

    // Remove decimal point if nothing after it. "1234." becomes "1234"
    if(str.Right(1) == ".")
        str = str.Left(str.GetLength() - 1);
    
    return str;
}

// Testing float's for equality. When the operands of operators == and != are
// some form of floating type (float, double, or long double).  Testing for
// equality between two floating point quantities is suspect because of
// round-off error and the lack of perfect representation of fractions.
// The value here is for testing two float values are equivalent within the
// range as specified by float_equality.
bool FloatsEqual(const float &a, const float &b)
{
    return (fabs(a - b) <= float_equality);
}

// This function wraps the given number so that it remains within its 
// base. Returns a number between 0 and base - 1.
// For example if the base given was 10 and the parameter was 10 it
// would wrap it so that the number is now a 0. If the number given
// were -1, then the result would be 9. This function can also be
// used everywhere where a number needs to be kept within a certain
// range, for example angles (0 and 360) and radians (0 to TWO_PI).
int CalcBase(const int base, int num)
{
    if(num >= 0 && num < base)
        return num;    // No adjustment neccessary

    if(num < 0) 
    {
        num %= base;
        num += base;
    } else {
        num %= base;
    }

    return num;
}

// Same as CalcBase() above, except using floats
float CalcBaseFloat(const float base, float num)
{
    if(num >= 0.0f && num < base)
        return num;    // No adjustment neccessary

    if(num < 0.0f) 
        return fmodf(num, base) + base;
    return fmodf(num, base);
}

// Make sure angle is between 0 and 359
int Angle(const int &angle)
{
    return CalcBase(360, angle);
}

// Calculates the length of a line between the following two points
float LineLength(const CPoint &point1, const CPoint &point2)
{
    const CPoint dist(point1 - point2);
    return sqrtf(float((dist.x * dist.x) + (dist.y * dist.y)));
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Simon Hughes
Software Developer (Senior) www.ByBox.com
United Kingdom United Kingdom
Member
C++ and C# Developer for 21 years. Microsoft Certified.
 
UK Senior software developer / team leader.
 
I've been writing software since 1985. I pride myself on designing and creating software that is first class. That means it has to be fast, scalable, and with good use of design patterns.
 
I have done everything from risk analysis and explosion modelling, banking systems, to highly scalable multi-threaded arrival and departure screens in many leading airports, to state of the art wireless warehouse systems.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionOk, but special numbers?memberMNorzagaray25 Apr '13 - 8:00 
The functions looks ok, but what behavior do you expect with special numbers (NaN, Infty)? This is usually beyond the scope of most programmers, I think.
 
Best regards, and congratulation for your very clean programming style.
 
Miguel
Generalconversion from float to doublememberRoman Tarasov1 Dec '09 - 2:48 
Nice work, but I'd like to ask you about conversion operations with floating point numbers in C++.
My problem is below:
 

int main()
{
float a;
a=1.35f;
double b;
b=0.0;
b=b+a;
printf("\n%.15f\n",b);
getch();
}
 
in theory we'll have in result: 1.35,
but in practice we'll have something like 1.3500000238418579 Cry | :((
 
Could you give me some advise?
I use Visual Studio 2008 Team System SP1
GeneralRe: conversion from float to doublememberSimon Hughes1 Dec '09 - 4:53 
Because of the inaccuracies of storing numbers in floats, you should use SigFig() function above so it removes the 238... bit at the end. If your after better accuracy, use double's everywhere.
However, for speed I'd always use float, but be knowlegable in the fact that numbers can't be stored exactly.
 
Some numbers (e.g., 1/3 and 0.1) cannot be represented exactly in binary floating-point no matter what the precision. Software packages that perform rational arithmetic represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "bignum" arithmetic for the individual integers.
 
Regards,
Simon Hughes

GeneralIMHO equality test, done this way, is rather arbitrary.member CPallini 31 Dec '07 - 10:49 
BTW what was the criterion that make yoou to choose 1.0e-20 (beacuse of it is exponentially half-way on float capacity? )?
 
The standard float equality test == is difficult to grasp fotr the newbie, but introducing an arbitrary constant, IMHO may be misleading for him(I think the constant should be problem, i.e. application, dependent).
 
BTW Happy new year.
Smile | :)
 
If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler.
-- Alfonso the Wise, 13th Century King of Castile.

[my articles]


QuestionFaster Round()?memberYap Chun Wei22 Jun '06 - 20:27 
Will this simple Round function be faster?
 
double Round(double val, int dp)
{
     int modifier = 1;
     for (int i=0; i<dp; ++i) modifier *= 10;
     return (floor(val*modifier+0.5)/modifier);
}
AnswerRe: Faster Round()? [modified]memberSimon Hughes24 Jun '06 - 13:11 
Congratulations. It's accurate (for positive numbers only) and it's 20 times faster WTF | :WTF: in both debug and release modes.
 
However, you need to fix it to do negative numbers correctly.
RoundNew(2.15, 1) // gives 2.2 (correct)
RoundNew(2.149, 1) // gives 2.1 (correct)
RoundNew(-1.475, 2) // gives -1.47 (incorrect, should be -1.48)
 
However with a little fix:
double RoundNew(double val, int dp)
{
    int modifier = 1;
    for (int i = 0; i < dp; ++i)
		modifier *= 10;
 
	if(val < 0.0)
		return (floor(val * modifier - 0.5) / modifier);
	return (floor(val * modifier + 0.5) / modifier);
} 
It all works now as it should.
 
20x faster, Excellent.
 
Regards,
Simon Hughes

GeneralRe: Faster Round()? [modified]memberYap Chun Wei25 Jun '06 - 14:52 
Thanks for the fix. But I think need another small modification. Otherwise some negative numbers may not work. Just need to use ceil instead of floor when the numbers are negative.
 
double RoundNew(double val, int dp)
{
      int modifier = 1;
      for (int i = 0; i < dp; ++i)
          modifier *= 10;
 
     if(val < 0.0)
          return (ceil(val * modifier - 0.5) / modifier);
     return (floor(val * modifier + 0.5) / modifier);
}

AnswerRe: Faster Round()?memberSimon Hughes26 Jun '06 - 8:42 
Yes your right. Thanks for the update.
 
Regards,
Simon Hughes

GeneralRe: Faster Round()?memberbkrahmer26 Feb '10 - 12:36 
According to my benchmarking, this is still 3x slower than this:
 
double Round(double dIn, int iPlaces)
{
    if (dIn < 0)
    	return -((long)(((-dIn*pow(10.0,iPlaces))+0.5))/pow(10.0,iPlaces));
    else
    	return  ((long)(((dIn*pow(10.0,iPlaces))+0.5))/pow(10.0,iPlaces));
}

GeneralRe: Faster Round()?memberkanbang4 Aug '10 - 16:05 
this is programing!
GeneralRe: Faster Round()?memberHoornet9322 Oct '07 - 21:04 
I just love this!!!!
Tnx!
 
93/93

Generalneed helpmemberravirevolt10 Mar '06 - 18:15 
i need a function which does like this:
quant(X,Q) takes two inputs, X - Matrix, vector or scalar. Q - Minimum value. and returns values in X rounded to nearest multiple of Q.
plz help me how to do. thanks in advance.
 
Ravi M.R
GeneralDouble to Float QuestionmemberT. Kulathu Sarma28 Nov '03 - 11:38 
Hi,
 
I am trying to convert a double precision number to a float and I am having this problem, please help me to resolve the same.
 
double fdblValue = 11574.24;
float fFloatValue = ( float ) fdblValue;
 
I am getting 11574.2 instead of 11574.24. What is the issue? Please let me know.
 
Regards,
 
Sarma
GeneralRe: Double to Float QuestionmemberNayan Choudhary1 Sep '04 - 22:25 
I tried this...
 
-------------------------
double fdblValue = 11574.24;
float fFloatValue = (float) fdblValue;
 
cout<< fdblValue << endl<< fFloatValue;
-------------------------
 
Saw your problem.
.............
 

Then I tried this...
 
--------------------------------------
double fdblValue = 11574.24;
float fFloatValue = (float) fdblValue;
 
cout.precision (8);
cout<< fdblValue << endl<< fFloatValue;
---------------------------------------
 
It worked!
 

But its flaw is D'Oh! | :doh: .. try setting "cout.precision (15)" instead of"cout.precision (8)" .. and see.
 
Always smile! Smile | :)

And if I am not for myself,
Who will be for me?
And if I am not for others, what am I?
And if not now, when?

GeneralVisual C++ float to string conversion problemmemberEd Storey8 Dec '00 - 5:47 
Hello,
 
I am just beginning with Visual C++ building a simple dialog program to run some
calculations and I ran into a little bit of a wall. I will be inputting a couple floating point
values and then hit calculate and the program will perform some calculations and output
the data to another control box. My problem is:
 
1) I get theinput no problem
2) I convert it into a floating point
3) I do the calculation
 
//the problem is here!!
4) once the data is calculated I want to output it.
However the problem lies in converting the new calculated float back to a string
in borland you have FloatToStrF and it does it no problem
However here in visual C++ I have not found a routine or function that does this to my
requirements.
My question is how do I take the float value and return it to the edit control box as a
string. I might just not be doing it correctly to begin with thus the problem or
confusion but here is a simple break down.
I rewrote this a little using distance speed and time I figured this would be better then
giving you my program with all the variables (its for a robots arm movement)

//on hitting calculate inputs distance and speed
// outputs the time it will take
void CExoSpinDlg::OnCalc()
{
// these two just to store the value of edit control boxes
CString someText;
CString someText2;

//edit control box 1 is distance
m_distance.GetWindowText(someText);

// edit control box 2 is speed, i am using a bunch of spin controls for my data
// as well so i will put one in here but it should not make a differance right?
m_SPINVALUE.GetWindowText(someText2);
 
// now we have the two values stored as CStrings
//from string to float (i am aware of the possible loss of exactness here
// but until i figure out Vc++ a little more i am stuck
float distance = atof(LPCSTR(someText));
float speed = atof(LPCSTR(someText2));
float time = distance/speed;
 
//MY PROBLEM IS HERE TAKING THIS FLOAT AND PUTTING BACK INTO
// THE DIALOG BOXES
//back to string is where i am struggling with
// i know the ftoa is NOT correct but I can not seem to find any other
// way of doing it
char buffer[256];
ftoa(time,buffer,10);
MessageBox(buffer);

}
 

any thoughts or suggestions on how i might do this would be
appreciated
 
Thank You
Ed Storey
GeneralRe: Visual C++ float to string conversion problemmemberCodin' Carlos20 Jan '02 - 13:20 
Well, there's always good ol' sprintf(buffer,"%.2f", time);...
 
- Carlos
GeneralRe: Visual C++ float to string conversion problemmembertomasusan7 Nov '06 - 8:41 
Thanks Carlos! It's the little things that matter.
GeneralFast dividesmembersigfpe22 Nov '00 - 10:01 
Suppose you have some code that calculates a bunch of rational functions: a rational function is function constructed using +, -, * and /. Then you can replace the code with something that uses only one /. It's only useful in certain situations depending on the relative speed of / and other operators. For example the following triangle rasteriser setup type code 'na = 1.0/a; nb = 1.0/b; nc = 1.0/c' can be replaced by 't = 1/(a*b*c); ct = c*t; na = b*ct; nb = a*ct; nc = a*b*t;". 7 multiplies and one divide should be faster than 3 divides. Damn - I've given away my secret.
--
SIGFPE
GeneralMaintaining Significant FiguressussDave Aebi28 Sep '00 - 16:12 
The significant figures specified in the original SIgFig routine get 'lost' when the value is converted to floating point. For example, if the result is 1.2 and we have specified 4 significant figures, we need to go to extra work to correctly display this as 1.200. The FloatToText routine converts it to 1.2; %f in the format will display something like 1.2000000. Yes, we can specify the precision modifier in the printf format code, but this requires that we compute the order of magnitude of the number so that we can determine how many decimal places we need to achieve the right number of significant figures. The following modified version of SigFig correctly produces the string version of the number. It is not pretty code, but it works!
 
CString SigFigStr(float X, int SigFigs)
{
CString str;
 
if(SigFigs < 1)
{
ASSERT(FALSE);
return str;
}
 
int Sign;
if(X < 0.0f)
Sign = -1;
else
Sign = 1;
 
X = fabsf(X);
float Powers = powf(10.0f, floorf(log10f(X)) + 1.0f);
 
float val = Sign * Round(X / Powers, SigFigs) * Powers;
 
str.Format("%f", val);
str.TrimLeft();
str.TrimRight();
 
int end = SigFigs;
if(Sign < 0)
end++;
if(str.Find('.') != -1)
end++;
 
str = str.Left(end);
 
// Remove decimal point if nothing after it. "1234." becomes "1234"
if(str.Right(1) == ".")
str = str.Left(str.GetLength() - 1);
 
return str;
}
GeneralSERIOUS Performance improvements for sqrtsussSimon Hughes27 Sep '00 - 5:18 
// I may post all this code as an update to the main topic.
 
// This is 3.4 times faster than using sqrtf(...)
#define FP_BITS(fp) (*(DWORD *)&(fp))
#define FP_ABS_BITS(fp) (FP_BITS(fp)&0x7FFFFFFF)
#define FP_SIGN_BIT(fp) (FP_BITS(fp)&0x80000000)
#define FP_ONE_BITS 0x3F800000
 

static unsigned int fast_sqrt_table[0x10000]; // declare table of square roots
 
typedef union FastSqrtUnion
{
float f;
unsigned int i;
} FastSqrtUnion;
 
void build_sqrt_table()
{
unsigned int i;
FastSqrtUnion s;

for (i = 0; i <= 0x7FFF; i++)
{

// Build a float with the bit pattern i as mantissa
// and an exponent of 0, stored as 127

s.i = (i << 8) | (0x7F << 23);
s.f = sqrtf(s.f);

// Take the square root then strip the first 7 bits of
// the mantissa into the table

fast_sqrt_table[i + 0x8000] = (s.i & 0x7FFFFF);

// Repeat the process, this time with an exponent of 1,
// stored as 128

s.i = (i << 8) | (0x80 << 23);
s.f = sqrtf(s.f);

fast_sqrt_table[i] = (s.i & 0x7FFFFF);
}
}
 

inline float fastsqrt(float n)
{
if(FP_BITS(n) == 0)
return 0.0f; // check for square root of 0

FP_BITS(n) = fast_sqrt_table[(FP_BITS(n) >> 8) & 0xFFFF] | ((((FP_BITS(n) - FP_ONE_BITS) >> 1) + FP_ONE_BITS) & 0x7F800000);

return n;
}
 
void main(void)
{
build_sqrt_table();
float a = fastsqrt(1.234f);
}
GeneralRe: SERIOUS Performance improvements for sqrtsussSteven J. Ackerman29 Sep '00 - 11:00 
Another Square Root Algorithm:
 
/*******************************************************
** square_root - single precision square root
********************************************************
** input: value to take the square root of
** output: nothing
** calls: frexp(), ldexp()
** returns: 0.0 if input value <= 0.0,
** otherwise square root of input value
********************************************************
*/
float square_root(float xx)
{
float f, x, y;
int e;
 
f = xx;
if (f <= 0.0)
{
return 0.0;
}
 
/* split mantissa and exponent */
x = frexp(f, &e); /* f = x * 2**e, 0.5 <= x < 1.0 */
 
/* Q - is power of 2 odd ? */
if (e & 1)
{
/* yes - double mantissa and decrement the power of 2 (exponent) */
x = x + x;
e -= 1;
}
 
/* compute exponent power of 2 of the square root */
e >>= 1;
 
/* Q - is the mantissa between sqrt(2) and 2 ? */
if (x > 1.41421356237)
{
/* yes - offset mantissa, compute series */
x = x - 2.0;
y =
((((( -9.8843065718E-4 * x
+ 7.9479950957E-4) * x
- 3.5890535377E-3) * x
+ 1.1028809744E-2) * x
- 4.4195203560E-2) * x
+ 3.5355338194E-1) * x
+ 1.41421356237E0;
}
 
/* no - Q - is the mantissa between sqrt(2)/2 and sqrt(2) ? */
else if (x > 0.707106781187)
{
/* yes - offset mantissa, compute series */
x = x - 1.0;
y =
((((( 1.35199291026E-2 * x
- 2.26657767832E-2) * x
+ 2.78720776889E-2) * x
- 3.89582788321E-2) * x
+ 6.24811144548E-2) * x
- 1.25001503933E-1) * x * x
+ 0.5 * x
+ 1.0;
}
else
{
/* no - mantissa is between 0.5 and sqrt(2)/2 */
x = x - 0.5;
y =
((((( -3.9495006054E-1 * x
+ 5.1743034569E-1) * x
- 4.3214437330E-1) * x
+ 3.5310730460E-1) * x
- 3.5354581892E-1) * x
+ 7.0710676017E-1) * x
+ 7.07106781187E-1;
}
 
/* calculate y = y * 2**e */
y = ldexp(y, e);
 
return y;
}

GeneralRe: SERIOUS Performance improvements for sqrtsussSimon Hughes2 Oct '00 - 3:12 
Your square_root() function is accurate, but is slower than sqrtf() iteself (about 3.5 times slower) :-
GeneralRe: SERIOUS Performance improvements for sqrtsussSteven J. Ackerman2 Oct '00 - 7:18 
Thanks for your response.
 
I'm using the square_root() function in an embedded x86 system written with the MSVC v1.52c compiler. The runtime didn't have a single precision sqrt() so this algorithm was faster for me than the runtime double precision version.
 
I like the table driven approach, and will investigate placing the table into ROM.
 
Steven J. Ackerman, Consultant
ACS, Sarasota, FL
http://www.acscontrol.com
sja@gte.net

GeneralRe: SERIOUS Performance improvements for sqrtmemberDQNOK18 Apr '07 - 12:20 
How accurate is this? About how-many bits off is it from the real answer?
 
I also wonder in real applications how much performance penalty one pays for having to load a 32k integer table (128k bytes?) into cache before this function can be called. Perhaps in very tight loops it's worth it.
 
Thanks for the bit-hacks. I'm always interested.
GeneralSERIOUS Performance improvements for 1/nsussSimon Hughes27 Sep '00 - 5:14 
// This is about 2.12 times faster than using 1.0f / n
// r = 1/p
#define FP_INV(r,p) \
{ \
int _i = 2 * 0x3F800000 - *(int *)&(p); \
r = *(float *)&_i; \
r = r * (2.0f - (p) * r); \
}
GeneralRe: SERIOUS Performance improvements for 1/nmemberemilio_g18 Nov '03 - 23:02 
Simon Hughes wrote:
int _i = 2 * 0x3F800000 - *(int *)&(p); \
 
uhhh ?? what kind of magic is this ?

GeneralRe: SERIOUS Performance improvements for 1/nmemberSimon Hughes18 Nov '03 - 23:59 
It's called speed voodoo Smile | :)
It's not as accurate, but it's damn fast.
 
Here are the results from a simple test:
float d = 22.0f, a = 1.0f / d, b;
FP_INV(b, d);
// a = 0.0454545
// b = 0.0448303

d = 5000.0f;
a = 1.0f / d, b;
FP_INV(b, d);
// a = 0.0002
// b = 0.000198521

 
Regards,
Simon Hughes
E-mail: simon@hicrest.net
Web: www.hicrest.net

GeneralRe: SERIOUS Performance improvements for 1/nmemberJasonDoucette18 Oct '04 - 10:21 
Simon, great job. You may wish to add a note to the source code comments that states it is an approximation function, and not necessarily a replacement for all uses of 1.0 / x.
 
Cheers,
 
Jason Doucette
http://www.jasondoucette.com/
GeneralSERIOUS Performance improvements for RoundValuesussSimon Hughes27 Sep '00 - 5:12 
// This is 15.5 times faster than RoundValue
__forceinline void FloatToInt(int *int_pointer, float f)
{
__asm fld f
__asm mov edx,int_pointer
__asm FRNDINT
__asm fistp dword ptr [edx];
}
GeneralRe: SERIOUS Performance improvements for RoundValuememberIlia Kirsanau2 Jun '03 - 21:39 

Simon Hughes wrote:
// This is 15.5 times faster than RoundValue
 
Hmm, I didn't really get how to use it for round value,
could you give an example?Eek! | :eek:
 
Ilia
GeneralRe: SERIOUS Performance improvements for RoundValuememberSimon Hughes19 Nov '03 - 0:10 
inline int RoundValue(float param)
{
	int a;
	int *int_pointer = &a;
 
	__asm  fld  param
	__asm  mov  edx,int_pointer
	__asm  FRNDINT
	__asm  fistp dword ptr [edx];
 
	return a;
}
 
int nVal = RoundValue(123.456f);
// or
float fVal = 123.456f;
int nVal = RoundValue(fVal);
 
inline void FloatToInt(int *int_pointer, const float &f) 
{
	__asm  fld  f
	__asm  mov  edx,int_pointer
	__asm  FRNDINT
	__asm  fistp dword ptr [edx];
}
 
int nVal;
float fVal = 123.456f;
FloatToInt(&nVal, fVal);

 
Regards,
Simon Hughes
E-mail: simon@hicrest.net
Web: www.hicrest.net
GeneralRe: SERIOUS Performance improvements for RoundValuememberbob1697231 Jan '08 - 5:14 
Simon Hughes wrote:
__forceinline void FloatToInt(int *int_pointer, float f)
{
__asm fld f
__asm mov edx,int_pointer
__asm FRNDINT
__asm fistp dword ptr [edx];
}

 
Would it be safer to explicitly set the control word before the round and restoring the previous control word once we're done? Or is this redundant?
 
It seems that if I wanted to "Round to nearest" (or "Round to even" since that is apparently the intel implementation based on the results for half-integers) I should not assume the control word had not been modified by other code run previously.
 
It would seem the code is still vulnerable to rounding errors if some prior code had set the control word to something other than the default 037F - Same as FINIT (round to nearest, all exceptions masked, 64-bit precision). This statement is based on the content in MSDN KnowledgeBase article Q126455 - specifically the line that says "Application programmers can avoid rounding errors in the second bug by not overriding the default rounding modes."
 
Since we are not sure if some prior code overrode the default, we need to explicitly set it to be sure.
 
By the way, thanks for the article. It has been a big help.
GeneralRound to Nearest X functionsussMartin MacRobert26 Sep '00 - 12:38 
Does anyone know how to round to an arbitrary number?
For example, rounding to the nearest 2.5, 0.5 or 1.33
Nearest 2.5 case: 2.2 rounds down to 2, 2.3 rounds up to 2.5, 2.7 rounds down to 2.5, 2.8 rounds up to 3
Thanks.
GeneralRe: Round to Nearest X functionmemberGene17 Jul '01 - 10:22 
Did anybody experienced problem with modf function? In most cases it works fine. But in some instances it would not work properly. Ex: 4570.0000 it would return 0.9999999 in fraction and 4569.0000 in integer part. Any ideas?
 
Thanks
Gene
GeneralRe: Round to Nearest X functionmembermier26 Nov '03 - 2:11 
The following function is a sample of how to round to the nearest .33 or .66
 
double GetNearest33( const double dRawNumber )
{
    double dBaseVal = floor(dRawNumber);
    double dBaseDiff = dRawNumber -  dBaseVal;
    double dToNearest33 = (double)((int)((dBaseDiff + 0.165)*100.0)/33*33)/100.0;
    (dToNearest33 == 0.99)?dToNearest33 = 1.0:dToNearest33;
    (dToNearest33 == 0.66)?dToNearest33 = 0.67:dToNearest33;
    return dBaseVal + dToNearest33;
}

GeneralRe: Round to Nearest X functionmemberzPilott27 Oct '04 - 9:51 
this rounds to the nearest x
the idea is to convert the number to fixed point, using the base that you want to round to, then convert back to floating point.
    float GetNearest( float number, float fixedBase ) {
      if (fixedBase != 0 && number != 0) {
        float sign = number>0?1:-1;
        number*=sign;
        number/=fixedBase;
        int fixedPoint = (int)floor(number+.5f);
        number = fixedPoint*fixedBase;
        number*=sign;
      }
      return number;
    }
 

GeneralRounding errata.sussDerekDaz22 Sep '00 - 10:21 
Lately, I've been trying to come up with a good rounding function that will even handle double SNAFUs. If you want a good example of one, simply set a double var to 105, multiply it by 1.15, and then by 20. Like below:
//dddDDDdddDDDdddDDD
double dTmp = 105.0;
 
dTmp *= 1.15; // equals 120.75....or does it?!!!
dTmp *= 2; // equals 241.5....or does it?!!!
 
//dddDDDdddDDDdddDDD
 
Now, when you try to round dTmp with a rounding routine, it will roung down to 241, instead of correctly to 242. I'm sure a lot of you know that this is caused by a limitation of representing fractions in a binary format *1.15 is the cause, I'm pretty sure that 0.15 cannot be represented correctly*.
 
My question is this: Has anyone seen a rounding alg that can handle one of these "off by .000000000000000001" doubles? I'm currently working on one right now, but my head is starting to get sore from banging it against the wall. I'm experimenting with binary arithmatic on this, but I'm not really sure if that is the way to go. Also, I think that IEEE 754-1984 may hold the answers, but I don't really want to spend $54 on it...
 
Dere
GeneralRe: Rounding errata.sussArlynn Smith25 Sep '00 - 3:49 
In windows I don't know of anyway with the current compilers, used to be under unix you had quad precision as an option. The only thing bad about that was that all you numbers were quad, but none of the intrinsic functions (sin, cos, sqrt) were quad so you had to write your own
GeneralRe: Rounding errata.memberreman18 Nov '03 - 17:42 
Actually when it comes to rounding it is common practice as well as maybe an IEEE specifiaction that for 0.5 all odd numbers round down and all even number round up. This is so that there is not a statistical anomaly when rounding.
 
eg. 1.5 rounds to 1, 2.5 rounds to 3.
 
eg.
(1.5 + 2.5) / 2 = 2
rounded properly
(1 + 3) / 2 = 2
always .5 round up
(2 + 3) / 2 = 2.5
 
Of course depending on the number of odd and even numbers you get closer to the real average, but you definately don't get further away!
GeneralRe: Rounding errata.memberMosc12 Jul '07 - 6:58 
RoundValue and FloatToInt will not work properly, because frndint instruction will not round to nearest, but to even for equidistant cases (ex. -0.5, 2.5, etc.).
GeneralComparing floating point values for equalitysussJohn Simmons / outlaw programmer20 Sep '00 - 3:32 
We all know that floating point numbers aren't necessarily equal. Given the following code:
 
double x = 1.0;
double y = 1.0;
 
we cannot assume that this line will return TRUE:
 
if (x == y)...
 
I use this function for comparing raw doubles:
 
//--------------------------------------------------------------------------/
BOOL AlmostEqual(double n1, double n2, int decplaces)
{
if (decplaces == 0)
{
return (floor(n1) == floor(n2));
}
double divider = 10.0;
for (int i = 1; i <= decplaces; i++);
{
divider *= 10;
}
return (n1 >= n2 && n1 < n2 + (1 / divider));
}
 

Simply pass in the two values you're comparing, and how many decimal places to compare, and you're all set.

GeneralRe: Comparing floating point values for equalitysussTerence Russell24 Sep '00 - 21:20 
I do something similar, but only take the difference between x and y and compare that to some minimal value (below which I consider to be zero).
 
eg:
 
double nThreshold = 0.000001;
// Where nThreshold is smallest number I consider valid
 
bool IsEqual(double x, double y, double nThreshold)
{
double nDifference = abs(x - y);
return ( nDifference < nThreshold ) ? true : false;

GeneralRe: Comparing floating point values for equalitymemberThomas Haase25 Nov '03 - 21:15 
I agree, FloatsEqual will definitely not work appropriate.
 
I have this solution, exactly based on the definition of DBL_EPSILON. As I understand this definition DBL_EPSILON is valid only for a window around 1.0.
 
/*static*/ bool Double::IsEqual(double a, double b, double epsilon /* = DBL_EPSILON */)
{                 
   const double& e = epsilon;
   if (b == 0.0) {
      // b has all bits zero
      if (a > 0.0   || a < 0.0)
         // a has significant bits, thus a and b are not equal
         return false;
     
      // a has no bit set, thus a and b are equal
      return true;
   }
  
   const double q = a/b;
   if (q<0) {
      //q negative, thus a and b have different signs, thus a and b are not equal
      return false;
   }
 
   //q is positive, thus a and b must have the same sign
   //q must be 1.0 if a and b are equal
 
   //check, whether q is in epsilon window around 1.0
   if (q > 1.+e) return false;
   if (q < 1.-e) return false;
 
   return true;
}

 
Thomas Haase
GeneralI use this for roundingsussJohn Simmons / outlaw programmer20 Sep '00 - 3:25 
//-----------------------------------------------------------------------------/
double TruncAt(double X, int Offset, double Prec)
{
return (floor(X * pow(10,Offset) + Prec) * pow(10, -Offset));
}
 
//-----------------------------------------------------------------------------/
double rounder(double n, int places)
{
if (places > 0)
{
double midpoint = 0.0,
placeval = 1.0;
for (int i = 1; i <= places; i++)
placeval *= 0.10;
midpoint = placeval * 5.0; // the value to check for rounding direction
return TruncAt(n, places, midpoint);
}
double fmodresult;
fmodresult = n - floor(n);
if (fmodresult >= 0.500000)
return (n + (1.0 - fmodresult));
else
return (n - fmodresult);
}

GeneralRe: I use this for roundingmemberHemme_one26 Nov '03 - 0:05 
inline double round(double in){
return floor(0.5 + in);
}

QuestionA Better FloatToText() function?sussJohn Simmons / outlaw pogrammer20 Sep '00 - 3:13 
CString FloatToText(double dVal, int nDecPlaces)
{
CString sNum;
sNum.Format("%.*lf", nDecPlaces, dVal);
return sNum;
}
 

AnswerRe: A Better FloatToText() function?sussJim Wuerch20 Sep '00 - 8:34 
Well, that does the float to string convert, but doesn't do what I'd use this particular function for, namely, stripping trailing 0's.
 
I'd add a while loop after the sNum.Format to start at the end of the string, and while the current char is a 0, replace it with a null until you hit the decimal, or the beginning of the string.
GeneralRe: A Better FloatToText() function?sussJohn Simmons / outlaw programmer21 Sep '00 - 0:26 
Have you actually tried it?
 
The asterisk in the format string tells the Format funciton that there's an identifier in the string which defines the number of decimal places to include.
 
Hence, calling FloatToString(123.450000, 2) would result in "123.45" How is that any different than what you're trying to do?

GeneralRe: A Better FloatToText() function?sussJohn Simmons / outlaw programmer21 Sep '00 - 1:01 
I was thinking about your statement after posting my reply, and I thought I'd add this:
 
If you want a function to strip trailing 0's from a double, you should write a function called "StripTrailingZeros". This function has (according to the prototype) a single purpose, converting a double to a string with a specified precision. It is considered by many to be poor programming technique to make a funciton do something that is not apparent by its prototype or parameter list.
 
My version of this function does *exactly* what the prototype defines, maintains the positive/negative status of the double, as well as ommitting the '.' from the string if the programmer specified 0 decimal places.
 
Here's a StripTrailingZeros function for you:
 
CString StripTrailingZeros(double dValue, int nPrecision)
{
// first we convert it to a string
CString sValue = FloatToString(dValue, nPrecision);
 
// then we reverse it to make finding 0's easier
sValue.MakeReverse();
 
// find first NON zero character
int nPos = sValue.FindOneOf(".,123456789");
// if we found one of those chars somewhere other than the 1st char of string
if (nPos > 0)
{
// strip zeros out
sValue.Mid(nPos);
}
// if 1st char is now a dot or comma
if (sValue.GetAt(0) == '.' || sValue.GetAt(0) == ',')
{
// strip it out
sValue.Mid(1);
}
 
// reverse the string again
sValue.MakeReverse();
 
// return the massaged string
return sValue;
}
 
I haven't tested this function, but it should work fine. Just thought I'd throw that out there...
GeneralRe: A Better FloatToText() function?sussJim Wuerch24 Sep '00 - 18:31 
Actually, I do have something separate I use when stripping zeros, I just created a macro that will strip zero's from the string after conversion. That way, I can do it inline (it's a pretty short piece of code), and not pay for all sorts of function calls and such.
 
I'm using the stripping on a bunch of values that just get dumped to a listview-like display. Most of the time, the values will have a bunch of trailing zero's that makes the display look busier than I'd like. Also, I'm just using it for columns that I expect the values to be rather integer-ish

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130516.1 | Last Updated 18 Nov 2003
Article Copyright 1999 by Simon Hughes
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid