15,937,735 members
See more:
I'm working in C++, but the language doesn't really matter here.

My brain isn't bending the right way. I'm thinking through a fixed precision numeric library and I can't figure out if I can simply get away with a fixed shift to place the mantissa. I don't know if the math works.

C++
```uint32_t val=1;
val<<=16; // turn into fixed point 16.16
uint32_t res = val>>16; // turn back into an int```

If I do that, can I perform div/mul add/sub on this number in fixed point form? What should I watch out for?

What I have tried:

I started studying other fixed point libraries but the ones I've found simply aren't very clear.
Posted
Updated 21-Dec-22 6:03am
Peter_in_2780 29-Nov-22 4:32am
Add/subtract is easy. Mul/div, you need to keep track of scaling. Lots of 16 bit shifts before/after operations. Otherwise it's just like regular integer arithmetic, with the same fun and games around overflow, divide by zero, etc.
honey the codewitch 29-Nov-22 5:00am
Thanks! I've kind of answered my own question since I asked it - funny how that seems to work so often, but now I'm interested in what others have to say. =)
CPallini 29-Nov-22 7:09am
I see a connection:
https://www.codeproject.com/Messages/5909708/Re-Gosh-I-messed-up-equals

## Solution 1

Well, you need to encapsulate your value inside a class, so then you can provide your own operator overloads to work with it. I'd suggest even making it a template, so you can choose the underlying representation size and the number of bits to shift; e.g.

C++
```template <typename Valtype, size_t ShiftCount>
class fixed {
Valtype value;
⋮
};```

Addition and Subtraction between identical fixed types just work by adding/subtracting their underlying value. For mixed fixed types, e.g. if a has `a` shift count of 16 bits and `b` has a shift count of 8 bits, then you have to align them before adding, and consider what the result type should be. I avoid the latter issue by not allowing mixed types in `operator+`, but I do in `operator+=` since it is explicit that the result needs to be the type of the left operand.

For multiplication, the result has a shift count that is the sum of the arguments' shift counts.

Now division is the hard one. I don't have a general solution in my own code, but have arranged things to suit the specific needs of the code that uses it. When you divide, how many fractional bits do you want in the answer?

If you don't generalize to a template that allows different shift counts, then multiplication will necessarily chop off the extra bits, and division gives your (only) type as a result. But you'll need more complex stuff inside the operators.

If you do allow different shift counts, then you can automatically allow promotion but not conversions that lose precision. The latter can be available with an explicit conversion.

Of course, it's also good to provide ways to view these values: have a `to_string` function, make it work with `ostream`, and make it work with the `fmt` library. But, also consider making the underlying conversion code follow the more efficient model introduced in C++17: to_chars allows the caller to supply a local buffer, which is practical since the maximum length is known. This makes implementing the `ostream` and `fmt` output functions more efficient.