Click here to Skip to main content

Minimalist Floating Point Type


The work presented here consists of a small, portable floating point data type and library, together referred to as SFP ("Small Floating Point"). The library is designed for implementation on even the simplest and cheapest of hardware platforms. In particular, devices previously dominated by application-specific fixed-point mathematics are targeted by this work. Furthermore, this work attempts to support devices for which high-level language tools (with their built-in types and operations) do not exist, or are not practical.

The library is presented here in two forms. First, a C version is provided. This is designed to be compiled for target platforms. If necessary, this compilation can be done by hand. Second, this article includes a version written in assembly language, targeting Microchip Technology's "PIC" 8-bit microcontrollers. In particular, the PIC 16F690 and 16F1827 are targeted, and the last two major generations of 8-bit PIC microcontrollers are thus supported by the resultant code base, with only trivial changes.

Portions of the library have been ported to the Freescale Semiconductor 6800 family. These are not presented here, but are a possible topic for a future article.

Here are some highlights of the SFP system presented here:

  • Super-wide dynamic range (1039)
  • 2.4 digit precision
  • Fully shift-based multiply and divide
  • Table-based logarithm and exponentiation functions
  • Correct rounding of results to nearest SFP approximation
  • Exhaustively exercised in distributed testing
  • Symmetrical design reduces bit arithmetic
  • 15-byte peak memory usage (plus code)
  • Flat function call model (zero internal calls)
  • Optional reentrant implementation
  • Three lightweight, non-conformant subsets
  • Modular design; each operation is free-standing


Floating point data types are a compact format for numerical data spanning a wide dynamic range. If one must store small numbers and large numbers together in the same type, it is cheaper with respect to storage to do so using a floating point type than using most other forms of representation. This is true because the internal representation of a floating point number includes an exponent. Just as it is quicker for a person doing calculations by hand to write a number in scientific notation (e.g., 1.1E+15, compared to its equivalent 1,100,000,000,000,000), it takes less space on a computer to represent very large (or very small) numbers in floating point. In both cases, it is the presence of the exponent that enables the briefer - and more efficient - representation. Truly general fixed point types end up being huge, e.g., the 128-bit C# decimal.

Data size notwithstanding, there are some real obstacles to implementing floating point operations on the very same cheap hardware as could otherwise benefit from its compactness. The smallest standardized floating point type until very recently was the IEEE-754 binary floating point "single", a 32-bit type [1]. Non-IEEE efforts tended to adopt a 32-bit minimum size as well. Hardware types invariably used a size of at least 32 bits, too (e.g., [2]).

Beyond their innate size (4 machine words on an 8-bit integer device), such types typically have a 24-plus-bit significand (the fractional portion of the number stored alongside the exponent, sometimes called the "mantissa") that thus spans multiple machine words on 8-bit hardware. Floating point routines end up having to do what amounts to 24-bit fixed point arithmetic - along with the bit level operations necessary to extract the significand.

Smaller floating point data types than the IEEE "single" were, until recently, not standardized, and mostly intended for classroom instruction; see, for example, this survey of formats. By 2008, though, 16-bit floating point data types were beginning to gain more prominence. IEEE standardized a 16-bit "half precision" binary floating point type in 2008, albeit in a form [3] that left the multi-byte significand, and the difficulties associated with it, in place.

This new IEEE 16-bit type also does not offer the range to serve as the basis of a general-purpose mathematics library; the largest representable number is ~65,000, similar to that of an unsigned 16-bit integer. Nevertheless, the decision made by IEEE to essentially shrink their floating point standard is an atypical and interesting one, in that it seems to defy the trend suggested by Moore's Law. This new IEEE type is not an academic curiosity; it has become an important part of 3D graphics, e.g., of the architectures of DirectX and OpenGL, and of compatible GPUs.

One implication of these developments is that computer scientists as a whole may have underestimated the potential usefulness of the sub-32-bit floating point type. The rest of this submission consists of an attempt to develop a general-purpose floating point data type that is even more amenable to implementation on 8-bit hardware than the IEEE "half precision" type.


The representation developed uses an 8-bit exponent in conjunction with a 7-bit significand and a sign bit. The byte in which the latter two values are stored is termed the "significand byte" in the text below. The exponent is a two's complement number (for easy addition and subtraction) and it occupies the other, "exponent" byte of the SFP format.

The significand byte holds the significand proper in bits 0-6, with the sign bit in bit 7. The significand proper's portion of the significand byte is an unsigned 7-bit number which ends up getting applied in conjunction with an implicit denominator of 128 and 8th bit of 1. The number with significand byte M and exponent byte E therefore evaluates as:


when the sign bit is clear, and:


otherwise, where M' is equal to M modulo 128. These two formulas can be unified into one as shown below:


The smallest representable quantity is therefore:


This evaluates to plus / minus 2-128. The largest representable quantity is:


This value is approximately (just less than) plus or minus 2128.

Necessarily, tradeoffs have been made which will render the SFP type unsuitable for many purposes. The 8-bit value of the significand (including its implicit top bit) yields log10(28) = log10(256) = 2.4 significant digits, which is comparatively poor.

However, the test applications developed for the new data type ("SFP") also demonstrate some potential usefulness in real world applications. A PID position controller was developed, and exhibited good stability and tuning characteristics compared to an earlier, fixed-point version of the same.

Additionally, programs were written to calculate very large factorials on some cheap 8-bit microcontrollers using SFP. These are devices that run on less than 2 volts, cost well less than $2.00 per chip, and have around 256 bytes of RAM (not including storage for code). The level of accuracy attained in these factorial calculations was considered encouraging by the author (e.g. just 0.040722% per calculation for 34!).

Moreover, in all cases, the use of floating point resulted in a more abstract, natural, and universally applicable programming idiom. Fixed point representations implemented on an 8-bit integer device necessarily "require the programmer to keep track of implicit scale factors" [4], which will end up being application-specific. Such details are abstracted away by the use of floating point, allowing the programmer to freely refer to, say, 0.001 and 123456789.0 using the same type and the same operations.

A comparison with the IEEE 16-bit and 32-bit floating point data types is made in the table below:

IEEE**   16bit   4.88e-04    6.8e+05    3.3 digits
IEEE**   32bit   5.96e-08    4.6e+38    7.2 digits
SFP      16bit   8.84e-02    3.3e+39    2.4 digits
  • * A measure of peak rounding error
  • ** In "Round Nearest" mode; rounding for logarithms, etc., not specified for IEEE types

The range of the SFP type is superior to that of both IEEE types. SFP can thus represent a wider array of numbers than these types. In particular, the deficiency of the IEEE half precision type with respect to even moderately large or small numbers has been addressed. The reduction in precision, especially compared to the 32-bit type, is an obvious tradeoff.

The IEEE half precision format, while deviated from, was selected as a starting point for the SFP design decisions. Targeted changes were made to the representation used by IEEE, and these are enumerated below this paragraph. Most of these relate to the role anticipated for SFP, i.e., selected, well-understood applications running on low-cost / low-power 8-bit computing devices like the PIC and 6800 families:

  • The exponent is expressed in standard two's complement vs. exponent bias (or "excess") format. This allows single-operation addition and subtraction of exponents, which are important operations, while also making comparison of SFP numbers slightly more difficult.
  • More bits are allocated to the exponent. This increases dynamic range to a level suitable for general-purpose use.
  • IEEE has special values set aside for codes (such as infinity and NaN). In SFP, out-of-domain conditions "bump" intuitively into the extrema of the SFP type instead. Or, if this is not possible, behavior is undefined.

There are other differences, mostly involving the scope of each type and the allowability of subsets. IEEE mandates the existence of printing and conversion functions, and specifies some strict standards for these. IEEE floating point also requires more mandatory library functions, including a more general "power" function. Interestingly, IEEE floating point also does not mandate a rounding strategy for what it terms "library" functions (versus core operations), and this includes the logarithm function. SFP uses the same rounding strategy for logarithms as for the other functions.

Other aspects of IEEE are adopted without major change:

  • The normalization strategy is the same; the entire term other than the term 2E lies between 1 and 2 (or -1 and -2). Since the exponent's base is 2, a given real number can only be represented using one possible exponent.
  • The representations of the significand and sign bit are identical in format to IEEE.
  • The IEEE "round-to-nearest" or "RN" strategy is used. (Round-to-even is not enforced for ties, a fact which has no impact on accuracy or precision.)
  • Selection of a radix of 2 is shared with IEEE and all other modern float formats.

Ultimately, it is hoped that what is presented below offers a novel perspective on what might be called "low-budget calculating", although the limitations of the type must admittedly rule out many applications. One final point in favor of SFP is that it has been exhaustively tested using brute force comparison of results, for all possible parameters, to results obtained using GCC floating point. These efforts are documented further below; suffice it to say here that floating point arithmetic represents a problem domain in which high-profile problems have been the rule, and testing itself has been a challenge (e.g., [5] and [6]). The original version of the Intel Pentium is a prominent example; also, the unsuccessful IBM RT PC was "scandalized mid-life with the discovery of a bug in the floating point square root routine". Every effort has been made to avoid these in the construction of the SFP library.

The C Implementation

The file "sfp.cpp," present in the root folder of the source code archive, contains the C implementation of SFP. This was developed using the GNU C++ compiler (GCC; more specifically, MinGW), and also compiles without warning in Visual Studio 2010. This C program is intended as a demonstration and starting point. Portions of "sfp.cpp" devoted to testing the library use C++ features. The implementation proper uses only features of the C language.

In unedited form, this program is of little use; GCC already contains a full retinue of built-in floating point types. The real utility of SFP lies in its portability to varied hardware. It is the intent of the author that selected subsets of "sfp.cpp" be excised and compiled (by hand, if necessary) for a variety of 8-bit devices. Nevertheless, "sfp.cpp" serves as the central locus for the definition and testing of SFP, and this program is explored in detail before any of the specific implementations.

The build command used during development is shown below this paragraph. This will build the file "sfp.exe". Because MinGW was used during development, on a Windows system, the full path to the compiler is specified; on a UNIX system with better access to the GNU compilers, this may not be necessary, i.e., the command can start with "C++", omitting the folder path and extension.

c:/mingw/bin/c++.exe sfp.cpp -osfp.exe

This should be executed from the root folder of the source code download, i.e., from the folder where "sfp.cpp" is resident. A command shell (UNIX) or Command Prompt window (Windows) should be used.

When loaded, "sfp.exe" enters a comprehensive test loop at the top of main(). This is designed to test its results for accuracy against the compiler's floating point system. Such a test will typically span several days, and a /reverse command-line option is provided to help distribute the load of testing. Specifically, it is possible to execute "sfp.exe" on one computer, and simultaneously begin the command "sfp.exe /reverse" on another, and the computers will attack the problem of comprehensive testing from opposite ends of the test sequence. This technique was used during the final testing of SFP, to ensure not just 100% coverage of code, but also 100% coverage of all possible parameter values. If the program completes without halting and displaying the verbiage "EXCEPTION!", then the comprehensive test has succeeded.

The rest of "sfp.cpp" other than the test code consists mainly of static methods implementing the supported operations. In general, these have a signature somewhat like this:

typedef unsigned char byte;
void sfp_sub
byte min, char ein,
byte min2, char ein2,
byte *mout, char *eout

Parameters min and ein are the input significand and exponent bytes for the left-hand operand, respectively. Similarly, the parameters min2 and ein2 are the input significand and exponent bytes for the right-hand operand. Pointers mout and eout point to the output significand and exponent bytes, respectively.

The supported operations are addition, subtraction, multiplication, division, exponentiation (2X), and logarithm (base 2). The names used for these in the C implementation are, in order, sfp_add, sfp_sub, sfp_mul, sfp_div, sfp_pow, and sfp_log.

"Primitive C"

The library functions in sfp.cpp are written using a subset of C, which has been designated "Primitive C". This special dialect is designed to record the operations performed by the library code at the byte level, assuming only the most basic capabilities (8-bit addition, subtraction, shift, and bitwise operations).

Primitive C lacks control structures, and types of greater than 8 bits, except for their brief use in overflow checks. In short, it does not allow anything that might be absent from a hypothetical 8-bit system, or just difficult to hand-compile. The only multiplication and division allowed are multiplication and division by 2, i.e., single-bit shifts; this reflects the capabilities offered by the most rudimentary processors, such as the PICs used here. On systems without high level language capabilities, Primitive C is designed to serve as a sort of generic quasi-assembly language, and to be quickly translatable into real code.

The use of this special subset of C seems likely to provoke criticism. It must be admitted that the result is somewhat ugly. Beyond the issue of aesthetics, the idea of taking a subset of a language and artificially constraining one's development activities to it seems problematic. Such a decision invites one to ask why the entire language was not used.

The author's initial experience with such a language subset was in a class that used the work of Wakerly [4] as a text. Wakerly uses a primitive subset of Pascal to model the Motorola (now Freescale) 68000 processor (pages 156-160). Like Wakerly's Pascal, Primitive C attempts to describe algorithms of a certain sort using a nomenclature that will be familiar to programmers.

A distinction must be drawn between writing C that follows strict portability standards, and writing something like Primitive C or Wakerly's special Pascal. Portable C resides at a higher level of abstraction than does Primitive C. Consider, for example, that the work described here does assume an 8-bit signed char type. The possibility of building a type that would fit into a C short, whatever that might be on a given platform, was considered, but rejected. What was deemed necessary here was an unambiguous 8-bit library, targeted at a specific sort of device. The possibility of even further abstraction was thus set aside.

An example of a sequence of Primitive C statements is shown below:

 //Handle log(1)
 if(*mout) goto nzmdss;
 if(util) goto nzmdss;

A number of salient features of Primitive C can be discerned from this fragment. To summarize, it sets the 8-bit parameter *mout to 128, and sets the 8-bit parameter *eout to -128, unless either *mout or util is non-zero.

First, it should be noted that the Primitive C code does not spell out how parameters should be passed, how values should be returned, or even how each function's local variables should be allocated. In the fragment shown above, *mout ("mantissa out") is starred because it is a pointer to a location where part of the return value is stored. In contrast, many intermediate values are stored in simple variables which can be accessed without indirection, and their names thus appear in the same context without a star, e.g., util.

This syntactic difference, though, is not necessarily evident at any given location in the assembly language code. In the very simplest implementations, all of these locations tie to hard-coded memory locations, and no equivalent syntactic difference is evident in the object code. Even in an implementation where *mout and *eout do end up getting returned on a call stack, these return values may still need to be manipulated in intermediate locations for most of an operation. In such cases, no difference between the use of identifiers that are starred in the C implementation and those that are not will be evident in the eventual assembly language code.

As hinted by the fragment, variables are 8-bit unsigned values (bytes), as are most parameters. Parameters explicitly dedicated to exponents (e.g., *eout above) use a standard (i.e., signed) char. Basically, nothing with a curly brace, other than the function declaration itself, is allowed. Note, for instance, that it would be much easier to express the last fragment as shown below, except that this would not conform to Primitive C as described:

//Handle log(1)

All non-parameter variables have type byte, necessitating typecast operations when dealing with signed data, e.g.:

 //Handle overflow condition
 if(((char)eout_tmp)!=-128) goto cnc45;
 if(!b128) goto cnc46;
 //Return +/- max.  SFP value.
 if(neg==neg2) goto caaa0;

Here, eout_tmp is holding signed data, and this treatment is made explicit by the typecast in the first statement. A few other key aspects of Primitive C are in evidence as well. The statement *mout|=128; sets the top bit of an unsigned variable; this evaluates to a single line of assembly language in simple fashion. Finally, the pattern used for program flow control is once again seen above. In general, it is not acceptable in Primitive C to follow an if with anything other than a goto statement.

All of the architectures to which SFP has been ported thus far make use of at least two basic registers: a condition code register (and, in particular, a Carry bit) and an accumulator. In designing Primitive C, decisions had to be made about what sort of abstraction to construct around these commonalities.

In the case of the accumulator, it was decided that no specific Primitive C declaration or variable ought to be created. Rather, it was determined that the availability of the accumulator should be assumed by the Primitive C statements. This liberalizes the definition of Primitive C, and avoids having to deal with an accumulator variable. Consider, for example, the following snippet:

if(ein==ein2) goto eein2outw;

The equivalent PIC assembly code is:

movfw ein1       ;EIN1 in ACCUMULATOR
xorwf ein2,w     ;XOR ACCUMULATOR and EIN2
btfsc STATUS,Z   ;IF ZERO...
goto eein2outw   ;...BRANCH

Note that the use of the register w is only implied in the Primitive C code.

This last fragment also exhibits a peculiarity of the PIC assembly language: all conditional operations are based on conditional skips, not on conditional branches. The opcode btfsc stands for "bit test [static RAM] file and skip [next instruction] if clear". In the last snippet shown, for example, the intent is to branch if the Zero bit of the status register is set. So, a goto is placed below the btfsc instruction. The goto gets skipped if the Zero bit is clear, and (importantly) executed if it is set.

In general, anything that follows the broad if/goto pattern can be a valid Primitive C conditional statement, and it is assumed that the accumulator register will be used as an intermediary to evaluate the necessary conditions. Another example of how the accumulator is used implicitly is shown below:


The equivalent PIC assembly code is:

movfw min      ;MIN IN ACCUMULATOR
addlw .128     ;ADD 128 TO ACCUMULATOR
movwf fullmin  ;STORE IN FULLMIN

This simple treatment of the accumulator abstracts over the specifics of the register, without adding any great difficulty to the task of translating Primitive C to assembly. In the case of the Carry flag, however, it did prove useful to provide an actual variable. Consider the following example:

 //Shift quotient_lo left, carrying up into min
 if(!c1) goto qu5e;

In summary, quotient_lo is multiplied by 2, and any carry out of the most significant bit is handled by incrementing the variable min. This evaluates, in the PIC implementation, into the following assembly language block:

;Shift quotient_lo left, carrying up into min
rlf quotient_lo,f
btfsc STATUS,C
incf min,f

In the assembly language implementation, it is necessary to clear the Carry bit before executing the left shift (rlf). The contents of this bit are placed into the bottommost bit of the result of any left shift on both the PIC and 6800 platforms. After that, note that the incrementing of min is predicated on the CPU's Carry bit, not upon any actual variable named c1. Correspondingly, in sfp.cpp, the variable c1 is declared in a special "Register Variables" section, and is not counted toward the 15-byte peak memory utilization quoted for SFP. Nevertheless, it is necessary to use something like an automatic variable (c1) to make the purpose of Primitive C clear.

Many assembly language arithmetic routines rely on the interplay of the Carry bit and shift mnemonics such as rlf. Consider, for example, the following snippet, taken from the PIC multiplication routine:

;Shift hi_byte:make_mout
rrf hi_byte,f
rrf make_mout,f

In the example above, the bottom bit right-shifted out of hi_byte in the first rrf operation gets placed into the top of make_mout after the second such operation. This happens naturally; rrf constructs the top bit of its result from the Carry bit, and correspondingly shifts the bottommost bit of its operand into the Carry bit. The Primitive C present in sfp.cpp spells this behavior out exactly, using the register variable c1 as the proxy for the Carry bit:

 // Shift hi_byte:make_mout
 if(!c1) goto chri8;

This is a very typical pattern on 8-bit devices. Such shifting behavior provides a foundation for performing calculations involving more than 8 bits.

Memory Allocation

Variables such as util and make_mout have been taken mostly as givens in the examples above, with no attention devoted to their allocation. In fact, this is an area where some leeway must be granted to the author of any one implementation of SFP.

These issues are handled in the simplest, cheapest way in the supplied C code. That is, it is assumed that no one instance of any SFP function will be running at any given time, at least not in the same allocation space. So, all storage ultimately resides in a series of global variables shared by all functions. These have names like loc000, loc001, etc., all the way through loc00E. It is the size of this suite of variables that establishes the "peak storage requirement" quoted for SFP above, and storage requirements are indeed minimized by such an approach.

However, it would hardly be convenient to write an entire floating point library using variables with names like loc00E, especially in the reduced subset of the C language in use here. So, prior to each SFP operation, blocks like the one shown below are present:

/*Define LOG16 mem. map*/
#define arg1 loc000
#define karg1 loc001
#define util loc002
#define exam loc003
#define neg loc004
#define rev loc005
#define moutL loc006
#define moutH loc007

Below each operation, these definitions are undone using #undef.

This default memory allocation setup is undeniably primitive. In many ways (its use of global accessibility, its use of static allocation), this setup exhibits exactly how one should not allocate memory. However, these decisions are only defaults, and will actually make sense for many applications.

One of the target platforms for this work, for example, was the Heathkit ET-3400. This is a system with just 256 bytes of RAM, for both program and data. On such a device, there is no real possibility of allocating local variables from a call stack, or even parameters. Static memory locations are simply selected, and the programmer must count himself fortunate to be able to execute floating point routines at all.

On other systems, more expansive design visions can be implemented. The PIC implementation provided with the article, for instance, does use a true call stack for parameters and return values. It also uses global variables in conjunction with #define for util, make_mout, and the like; on systems requiring reentrancy, util, make_mout, etc., should be allocated as automatic (i.e., stack) variables, so that dueling threads-of-execution do not use shared storage in ways that conflict.


The arithmetic operators provided in the SFP library are based on the work of Wakerly [4]. These algorithms require 16-bit multiplication and division, and implementing these optimally requires some work. In all cases, such calculations are performed using shift operations, not repeated addition or subtraction.

The shift-based 16-bit multiplication used by sfp_mul() was taken from a computer engineering syllabus placed online by Rensselaer Polytechnic Institute. The analogous division routine used within sfp_div() is an example of Non-Restoring Division. The implementation was actually based one of several Algorithm[s] for Hardware Division placed online by Harvey Mudd College, and is also somewhat similar to the SRT Division Algorithm implemented in silicon in several later (i.e., post-bug) Pentium processors.

In both cases, the shifting techniques described in the "Primitive C" section above provided the underpinnings for much of what these sources describe at a higher level. Achieving properly rounded results was largely a matter of maintaining lower byte storage areas to hold values shifted out of a register during such operations. The top bit(s) of these were used to guide the rounding process.


The algorithm used by SFP to calculate radix 2 logarithms is original with the library. The task of taking an SFP number and finding its logarithm is eased by the fact that the second, exponent byte of the number can serve as an estimate of the result. Consider once more the formula for the value of an SFP number having significand byte M and exponent E:


Because there is no overlap between SFP numbers, the left-hand factor of this expression,


must lie between 1 and 2, and the resultant SFP number must lie between plus or minus 2E and plus or minus 2(E+1).

The base 2 logarithm of the number will thus lie between E and E+1, and an SFP approximation of this number can be obtained using the normalization process.

If the parameter's exponent is 3, for example, then the correct logarithm will lie between 3 and 4, and an initial approximation can be constructed as 3, or (3/128)*27. All normalized SFP numbers consist of a factor 2X times a factor between 1 and 2. Because of this, the value 3/128 in this estimate must be shifted left 6 times (i.e., multiplied by 64) and its exponent decremented by 6 in compensation. This yields a representation for this estimate of (192/128)*21,or M=64, E=1.

Further detail is obtained by using the parameter's significand byte M. In the C implementation provided, this is handled using a lookup table. This table is indexed by the parameter significand byte, modulo 128, and the result is added to the aforementioned approximation before normalization. The values in the table are approximations of:


In the formula above, x is the index value. Because the original SFP parameter was expressed as a product, the logarithm of the said number will thus be the sum of its two factors - this is the main point of logarithms - and adding such a value directly into the end result is mathematically justified.

The provided logarithm function can be extended to calculate logarithms of any base. Euler's number "e" and 10 are examples of other bases that are in common use. To calculate logbX, it is necessary to divide log2X by log2b, where "b" is the base other than 2. Note that the denominator here is a constant, meaning that the calculation of logarithms of arbitrary base requires only an extra division operation, not an additional call into the logarithm function itself.


The overall translation realized by the SFP exponentiation function is to take an SFP number having value V and return 2V. For an SFP number having significand byte M and exponent byte E, the strategy used is to return an SFP number with an exponent equal to the value of (M mod 127). This number is then shifted based on E.

Consider some easy examples. If M=0 and E=7, no shifting is required. The factor formed by the significand of 128 (M plus the 128 accounted for by the "hidden" implicit top bit) is the correct exponent of the answer; in other words:


(These numbers are outside the domain of the type, so the maximum type value is returned.)

Similarly, if M=0 and E=6, a single right shift (division by 2) applied to the significand yields the correct result:


The number of shifts applied to the output exponent is 7-E, where E is the input exponent, e.g.:



and so on.

Additional detail is provided using another lookup table, this one having 256 entries. The index into this table is constructed by taking the bits shifted out of the result's exponent byte in performing the divisions shown above. The component obtained from the lookup table accounts for the fractional portion of the division's result. For instance:


In the progression shown above, L is the value obtained from the lookup table. In essence, it ends up determining the value of the result's significand, and the formula for each entry in the table is (2i/256-1)*128, where i is the index, i.e., the offset from the table start.

In all these manipulations, the question of domain looms large. Passing 127.5 or greater into sfp_pow() returns the maximum possible SFP value, for example, and in the implementations provided, several such situations are checked for near the top of the function.


The simplest floating point algorithms handle the issue of rounding using truncation. This was deemed inadequate for SFP; the precision of the type is already low, and simply abandoning meaningful data would only worsen things.

Instead, it was decided that each SFP operation should return the SFP number closest to the theoretical result of the requested operation. This theoretical result must be calculated using:


for each SFP parameter, not using any pre-approximation value; other than that, this "round-to-nearest" strategy is easy to describe, and easy to test for (given an existing mathematics library of sufficient precision and accuracy).

However, rounding results to the nearest SFP number actually presents some difficulties. At each shift operation present in each operation, for example, the possibility of data loss exists. One approach seen at several places in SFP is to shift bits down into a bit- or byte-sized memory location, in cases where detail would otherwise be abandoned. The exponentiation function's use of this strategy was already discussed, and in several places where floating point values must be normalized, this tactic repeats itself. Many processor architectures facilitate such operations by shifting into, and out of, the machine Carry bit. This is true of both the PIC and 6800 families.

Where lookup tables are used, an additional issue is the interplay between the code's rounding and the inherent rounding applied to the table values, if any. If the least significant bit of the table value was rounded up, one might ask, then how should this affect rounding inside the function itself?

The collective implications of all such situations are difficult to fathom, and the problems they create are only made worse by the relative scarcity of prior work involving 16-bit floating point implementations. (32-bit rounding techniques do not necessarily scale down well; lookup tables, at least of the sort described here, are probably impractical in size, for example.)

So, the approaches selected are necessarily ad hoc in style. In all cases, though, the design process was informed by an examination of the tradeoffs involved. Such tradeoffs had to be made between, for example, adding additional lookup data versus retaining more bits during calculation or hard-coding corner cases.

In the logarithm function, for example, a 128-entry lookup table indexed by (M mod 128) is used. Initial attempts to achieve proper rounding using 128 8-bit table entries resulted in many rounding errors. Adding another 128 entry lookup table essentially fixed these errors. (GCC still identifies a single "corner case" where hard-coded handling is used.)

Because these tables are 128 entries in size, 256 words of lookup data (and one conditional) are required. The values in the lookup table used by sfp_log() are expressed in increments of 2-16, i.e., in 65,536ths. Initially, the table values were rounded, i.e., the last bit of the number in the table was selected such that the closest possible approximation was yielded. In some cases, later rounding operations inside of sfp_log() interacted with this rounding in undesirable ways, and the table values were tuned by hand to force correct results.

The power operation presents different tradeoffs. The lookup table in this case happens to be 256 entries long, since it is indexed by a full unsigned byte. This means that 8 bits of lookup data per entry by themselves take up 256 words of storage. This number is considerable; it is not possible to allocate on the base Heathkit ET-3400, for instance, even before one considers the code itself and the 15 byte peak storage requirement. (On the PICs, lookup tables can be stored in the somewhat larger program storage area.)

So, the efficiency of adding additional lookup data (giving a minimum total of 512 words, on an 8-bit device) was weighed against the possibility of adding corner case logic, and it was found that the latter option was superior. Only 71 such cases need to be handled, and this can be accomplished using fewer than 256 words, even on an 8-bit PIC. Like the logarithm implementation, this implementation seemed to occupy what might be called an architectural "sweet spot."

Exceptional Conditions

In the default implementation, the domain of all functions encompasses all representable numbers, and correct results will be returned in all cases. Most basically, this specification must be qualified by the caveat that taking the logarithm of a negative number will result in undefined behavior.

Other than that, the closest SFP number to the answer will be returned. If the answer is too big to represent, then the largest representable number will be returned, with the appropriate sign. Note that this behavior includes operations that would return infinity under IEEE-754. Similarly, if the number is too small to represent, the smallest representable number will be returned, again with the appropriate sign.

SFP does not have a distinct zero; the smallest possible values (where M is in {0, -128} and E=-128) serve as proxies for zero where this makes sense. The logarithm of 1 is calculated as M=0, E=-128, for example. Like IEEE-754, the SFP number system encompasses both positive and negative zeros (since this smallest possible value can be represented with a sign bit of 1 or 0).

Division by zero returns plus or minus infinity under IEEE 754, and the same is true of SFP. While SFP does not have a distinct infinity value, plus or minus MAXREAL, where MAXREAL is the largest possible SFP value, stands in for plus or minus infinity. The rules for determining the sign of the infinite result are the same as under IEEE-754.

None of these SFP design decisions are meant to imply that detection of exceptional conditions is unnecessary for a given application, and any production application using floating point ought to rest on a thorough understanding of the type used. Developers of SFP applications, in particular, must be cognizant of the way that the limited precision of the library affects addition and subtraction; e.g., adding a small quantity to a large one may produce no effect. Such considerations must be understood and handled using means outside of SFP.

At the same time, though, this is not so different from the burden placed on any programmer using floating point. The issue of adding very small values to very large ones, without effect, is inherent to floating point. More generally, exceptions are a means of communicating situations back to the application programmer, but they do not remove the underlying considerations of domain that accompany binary floating point.

The design of SFP acknowledges the fact that full IEEE-754 exception handling is impractical on many devices. This changes the exact nature of the interface between SFP and its application programs, without setting SFP apart from other floating point libraries in any fundamental sense.

SFP Subsets

Two subsets are defined by sfp.cpp using the C preprocessor:

  • If FASTMATH is defined, then sfp_pow() and sfp_log() do not apply any special handling to corner cases, and rounding errors are therefore possible. These are single bit errors, although it must be remembered that sfp_pow() will invoke sfp_div() in some cases, further decreasing accuracy.
  • If BOUNDSCHECK is not defined, exponent overflow is never checked for. This applies to the overall operation, as well as some intermediate calculations that might exceed the range of the system even if the eventual result would be representable. Nevertheless, it is still allowable to use some numbers of very large/small magnitude even without bounds checking. Conservatively, it can be said that operations having best results with exponents ranging from 2-126 to 2125, inclusive, should at an absolute minimum return correct results even without bounds checking. Importantly, multiplication of values by zero, and addition of values to zero, still behave as expected. These operations, and all of the intermediate calculations they involve, have correct results that lie within the SFP domain. Division by zero, on the other hand, results in undefined behavior if BOUNDSCHECK is not defined (as opposed to the full SFP implementation, in which such operations return the maximum possible value). Similarly, division operations with a zero in the numerator cannot be trusted if BOUNDSCHECK is not defined. Such operations take the smallest possible SFP quantity and attempt to reduce it in size, creating an out-of-bounds condition which this subset of the library will not detect.

Both of these options reduce the size of the resultant library code considerably.

In addition, it should be realized that most of the SFP operations are independent of the other operations. The logarithm function can be taken by itself and implemented in a system where a logarithmic scale needs to be applied, for instance, without any need to implement the other SFP operations. Subtraction does execute a goto into the addition routine, and negative exponentiation uses a goto into the division routine to invert the final result.

The PIC Implementation

The PIC implementation supplied with this article is present in the pic subfolder of the root folder of the source code Zip file. The PIC code provided includes addition, multiplication, division, logarithm, and exponentiation functions. These are based on the Primitive C code with BOUNDSCHECK undefined, and FASTMATH defined. The functions comprising this library are named addf, mulf, divf, logf, and powf.

A demo program is also provided, which calculates 34 factorial and prints it repeatedly to the PIC serial port (pin TX). This value is very close to the top of the range of the SFP type, and beyond the range of the two smaller IEEE types. The demo program uses serial communication techniques already described in a previous article. The 16F690 demo outputs its calculated value (34!) at 57,600 baud. The 16F1827 demo transmits at 115,200 baud. Both programs use 8 bit bytes, with no parity and one stop bit.

This demonstration, and the SFP implementation, are written in Microchip assembly language and can be built using the MPASM assembler together with the MPLINK linker. Typically, the entire MPLAB development environment should be installed; the author used version 8.60. The MPLAB suite also includes a simulator, which can be used to run the SFP demo without acquiring any PIC hardware at all. MPLAB is currently available without any up-front fee from Microchip Technology.

The arithmetic operator functions are stored in files named sfpaddf.asm, sfpmulf.asm, and sfpdivf.asm. The logarithm is stored in sfplogf.asm, but the exponentiation function shares sfpdivf.asm with the division function upon which it depends. These files are independent of each other. It is not necessary to assemble, link, or otherwise include a file unless an application specifically requires an operator in that file.

A print function is provided to display SFP numbers in base 10, along with comparison and conversion functions. Most of these functions are stored in the common file sfp.asm, which is mandatory for all applications. All of the files mentioned thus far, i.e., all of the SFP library files, are stored in a common folder, which is shared by all targeted PIC processors.

Ultimately, whichever functions are necessary for a given application are each assembled individually, along with the application's own files, using MPAsm.EXE. Then, these must be linked together with MPLink.EXE. This is demonstrated by make.bat, the build script for the PIC SFP large factorial demo. Another batch file, clean.bat, is provided to remove all of the build process' outputs. (It should be noted that all .bat files provided are intended only for the Windows "Command Prompt.")

The print function (printf) is rudimentary; rather than deal with issues of formatting, it outputs SFP numbers in a fixed-size format similar to scientific notation, e.g.:


The caret character '^' stands in for exponentiation. The numbers shown above are -3 (top) and 0.5.

The conversion functions are utof, which creates an SFP number from an 8-bit unsigned value, and ftou, which converts an SFP number into an 8-bit unsigned value. These reside in the source files sfputof.asm and sfpftou.asm. The debugging function dbgpkf is also provided (in sfp.asm). This function prints the float stored atop the parameter stack, but leaves it in place. It is thus designed to be inserted directly into problematic code without disrupting its operation.

The PIC 16F690 and 16F1827 are targeted. These two devices typify the last two generations of PIC 8-bit microcontrollers. For example, Microchip Technology sells a demo board and/or starter kit based around each of these exact devices. In general, members of the last two major generations of 8-bit PIC microcontrollers are supported by the resultant code base, with only trivial changes.

Each device-specific subfolder of the pic folder contains its own version of make.bat, which is used to build target.asm, along with all of the SFP assembly files, the stack implementations, etc., into a HEX file. Note that make.bat assumes that MPASM and MPLINK are installed in their default locations, i.e., under C:\Program Files\.

The HEX file can be programmed onto a PIC device using any recent Microchip Technology programmer. During development, the author used a PICKit 2 programmer, with OS firmware 2.32.00 installed. Version 2.61.00 of the PICKit 2 programmer utility was used. After programming, and after any reset or loss of power, the PIC will then begin emitting the value of 34!, which is the largest factorial that can be expressed using SFP. More details about programming, and about setting up serial communication, are available from this article.

The exact text transmitted repeatedly on the PIC serial port is:


This evaluates to 2.9908e+38. The actual value of 34! is 2.9523e+38. The error is thus 1.3031%, which must be weighed against the fact that 32 multiplications are performed to obtain this result. This yields a per-calculation error of just 0.040722%. This number testifies to the level of precision and accuracy attainable using SFP, despite its relatively low (2.4 digit) nominal precision. Careful attention paid to rounding issues during the development of SFP contributes to this performance.

High-Level Structures

The PIC implementation uses dual software stacks (in addition to the small, hardware-resident return address stack). The first of these (i.e., the stack indexed using FSR on the PIC 16F690, or FSR0 on the PIC 16F1827) is used to pass SFP parameters and return values, and to effect the calculation of factorials made by the supplied demo program. Macros PUSH and POP assist with these operations; push operations take the value to be pushed from the accumulator w, and pop operations are executed into w. SFP values are passed by pushing the significand byte followed by the exponent byte. For binary operators, the first SFP number pushed is analogous to the left-hand operand, e.g., the numerator for division.

The code for these stacks, and for a few other items of infrastructural code, is specific to the generation of the PIC device being targeted. The PIC 16F1827 has two index registers, for instance, whereas index register swapping is (somewhat painstakingly) implemented in software on the 16F690. As such, these portions of the code reside in two application-specific subfolders, pic/pic16f690 and pic/pic16f1827. The file containing the demonstration program's entry point is also device-specific, and distinct versions of this file (target.asm) reside in these same device-specific subfolders.

Note that much of the code at the top of target.asm is boilerplate UART setup code; more information about this code is available here. Here, let it suffice to say that the PIC implementation emits its output on pin TX, at 57,600 baud, with 8-bits per byte, no parity, and one stop bit.

One major obstacle to the calculation of large factorials on the 8-bit PIC is its small, built-in stack. The PIC call and return facilities rely on a hardware stack of limited depth. In the 16F690 (prior generation 8-bit PIC), this stack stores return addresses for just 8 function calls, with no provision for detecting overflow. This means that one obvious implementation of factorial (exemplified by the Scheme function below this paragraph) will quickly break down, and this is true regardless of the data type used for the numbers involved:

(define (fact n)
 (if (= n 0)
     (* n (fact (- n 1)))))

This Scheme code is not tail recursive, and thus requires a return address to be placed onto the stack for each multiplicand of the factorial (e.g., 20, 19, 18, 17... etc. down to 2 for 20!). This is not feasible on the minimalist hardware targeted by the work presented here.

As an alternative, consider the C++ factorial program shown below. This program is very similar in construction to the supplied PIC demonstration programs:

using std::stack;
using std::vector;

stack<float, vector<float> > s;

void longfact();

int main(int argc, char *argv[])
 longfact(); //Endless loop
 return 0;

void build();  //Build list like 0.0, 5.0, 4.0, 3.0, 2.0
void mulstr(); //...invoke stack multiply on list until 0.0

void fact()

void longfact()
  printf(" %f",;

void build()
  if(2.5<{s.push(;}else break;

void mulstr()
  float; s.pop();

This program first builds a list of floating point numbers, something like "0.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0" on stack s, and then performs stack-based multiplication until the 0.0 is found. No recursive function calls are used in this code. This is true of the PIC code in each version of "target.asm" as well. If expressed in Scheme, such an implementation would use only tail recursion.

The code in target.asm implements the same basic algorithm as this C++ code, and thus operates at a high level of abstraction. This code is reminiscent of the object code of a compiler or the threaded code used to execute some high-level languages. The next few snippets of code from target.asm, which execute sequentially, typify the level of abstraction at which the code in this file operates:

movlw .32
movlw .1

The snippet above places the SFP approximation for 2.5 (part of the C++ factorial algorithm just shown, and approximated as M=32, E=1, or 160/128*21 ) onto the stack.

The next sequence takes the first two parameters pushed for the executing function (build - again, this is evident in the C++) and copies them to the top of the stack using the function parm. Before calling parm, the byte index of the parameter requested (e.g., 0 for the topmost parameter) must be pushed onto the main stack:

movlw .1
FAR_CALL build,parm
movlw .0
FAR_CALL build,parm

Next, the gtf function from sfp.asm is invoked, to compare the two SFP numbers now present at the stack top:

FAR_CALL build,gtf  

The gtf function consumes these two floating point numbers (four bytes) from the stack top, and places a single byte value on the stack in return. This will be non-zero if the first SFP number passed is larger than the second number passed (i.e., than the topmost SFP number), and 0 otherwise. This result value is then used to choose between two conditional branches:

POP  ;XOR literal 0 with W, result in W, set Z flag based on result
xorlw .0
btfsc STATUS,Z
goto hlllb51J5
goto hlllb51J6

While somewhat primitive compared to the application code typically written for a general-purpose computer, the code shown above does actually operate at a reasonably high level of abstraction. It tracks the C++ factorial program just shown very closely, for example, and what is actually present in target.asm is just a thin layer scripting the actions of SFP, the stack library, the "kernel" .asm files, and a few macros built to support these operations.

One must also consider the possibility of generating such assembly language sequences automatically, as the object code of a high-level language. The second software stack implemented by the supplied PIC code hints at how such a language might implement parameterized functions, for example. Throughout the execution of the various instances of build and mulstr necessary for the calculation of 34!, the second stack is used to hold base pointers, which are utilized by the parm function from the "kernel" assembly file to obtain parameters.

In terms of the C++ factorial implementation shown above, these "base pointers" equate to pointers to the top of s as it stood at the beginning of any given function call. This results in a slightly different idiom for manipulating the central stack. In C++, in the function mulstr, for example, top, pop, and push are combined to effect comparisons involving the second (bottom) of two SFP values stored atop the stack. In target.asm, a pointer placed atop the second stack at the start of each call to mulstr is used to obtain this SFP value instead, and the kernel function parm effects the necessary dereferencing.


This article leaves open many directions for future work. One obvious avenue for exploration is the application of SFP to real problems (or at least more interesting ones). The PID controller mentioned earlier will quite likely grow into an article of its own at some point in the future. Many applications will require additional library functions to be developed.

Another possible topic is a more complete PIC implementation (e.g., with bounds checking). The techniques described here are also designed to be ported to other platforms, including, but not limited to, the 6800 and its many descendents.

During the development of SFP, parallel development of a similar 32-bit type was under way. This type augments the significand byte with two more bytes of data, leaving the significance of the existing SFP bytes completely unchanged. Part of the appeal of this type is that it is trivially interchangeable with the SFP type presented here. If we represent a 32-bit "extended SFP / ESFP" number as M, N, O, and E, where N and O are extra significand bytes, then this number has an SFP approximation of M, E (M can be rounded up based on the top bit of N). Similarly, any existing SFP operation can be trivially turned into a preliminary ESFP implementation, with a known level of inaccuracy.

The level-of-completion for this 32-bit work is lower than for SFP; exhaustive testing is not practical, nor are the lookup tables used for the SFP exponentiation and logarithm operations practical, at least not without modifications. These obstacles notwithstanding, this 32-bit type may grow into a submission similar to this one.

The tradeoffs presented by a 24-bit type having a single extra byte of significand data seem favorable as well; as one commenter pointed out, there is no particular reason to favor 32 bits over 24 on a true 8-bit device. These more ambitious implementations would most likely need to be tested as well as possible using non-exhaustive means. At least initially, they could rely on the existing 16-bit exponentiation and logarithm routines.

Finally, an expanded treatment of the high-level strategies applied in the PIC large factorial demonstration will eventually be forthcoming. In examining the ways in which others have been able to stand up similar high-level facilities quickly and cheaply, FORTH can be identified (respectfully, and in full recognition of the incompleteness of this author's own treatment) as perhaps the most similar existing effort. This is exemplified by in FORTH's stack orientation, its reliance on something resembling threaded code, its lack of formal type checking, and its occasional reliance on a cross-compiler.

A future article will describe how it is possible to take the high-level form seen in target.asm and build a cross-compiler capable of generating such object code from a high-level language. Much of the underlying work necessary for such a cross compiler (and runtime) has already been done, and this treatment encompasses a broad array of PIC devices, with the promise of even further portability. Taken in conjunction with the SFP article at hand, and its predecessor, this future article will provide an overview of what might be called "computing from the ground up." More formally, this series of articles will provide a high-level overview of a fairly abstract computing system, built up quickly and cheaply from the IC level.


[1] IEEE 1985. "IEEE standard for binary floating-point arithmetic." ACM SIGPLAN Notices 22, 2 (Feb.), 9-25

[2] Programmer's Reference Manual (68000)., Motorola, 1992

[3] IEEE Standard for Floating-Point Arithmetic 754-2008, IEEE, 2008

[4] Wakerly, J. Microcomputer Architecture and Programming: The 68000 Series. New York: Prentice-Hall, 1989.

[5] Karpinski, R. 1985. "Paranoia: A floating-point benchmark." Byte Magazine 10, 2 (Feb.), 223-235

[6] Monniaux, D. (May 2008). "The pitfalls of verifying floating-point computations". ACM Transactions on Programming Languages and Systems 30 (3): article #12.


This is the tenth major version of this article.

Compared to the first version, the second version of the article contained improvements to the formatting and clarity of certain passages.

The third version updated the treatment of division-by-zero given in the article text, and corrected additional formatting errors.

The fourth version of the article added more detail about the actual Microchip Technology development tools used. A few small improvements to the explanatory text were also made, for reasons of clarity and grammatical correctness.

The code remained mostly unchanged throughout all of these revisions. The third version of the article did include some code changes related to the handling of division-by-zero, which brought the library into conformance with IEEE-754 in this area. The fourth version included a trivial optimization in the division routine. This optimization did not affect the result returned by the division function.

The fifth version added logarithm and exponentiation functions to the PIC implementation of SFP. The "Primitive C" version of the logarithm function was also altered slightly, to more accurately reflect the author's experience in writing the PIC logarithm function. The overall algorithm remained the same, along with the results returned by the function.

The sixth version corrected a problem with the iszerof function in the PIC implementation. Previously, this function returned 1 (true) for a few very small non-zero numbers. Now, it returns 1 only if passed an SFP zero.

The seventh version corrected another problem found in the PIC implementation, involving the addf function. Certain additions involving operands with two different signs were returning values with an incorrect sign.

The eighth version of the article removed some unreachable code from the PIC addf function.

The ninth version of the article added some more details about the SFP library subsets, and their capabilities. In particular, the limitations present when BOUNDSCHECK is not defined are described more fully.

This tenth version improves the readability of some of the explanatory text. 

Web04 | 2.8.160207.1 | Advertise | Privacy
Copyright © CodeProject, 1999-2016
All Rights Reserved. Terms of Service