Fast Automatic Differentiation in C++

Eugene Shatokhin

4.67/5 (10 votes)

Sep 1, 2006

CPOL

12 min read

77641

2868

An article on fast automatic differentiation (FAD), another way of computing the derivatives of a function.

Sample Image - fadcpp1.gif

Introduction

Fast automatic differentiation (FAD) is another way of computing the derivatives of a function in addition to the well-known symbolic and finite difference approaches. While not as popular as these two, FAD can complement them very well. It specializes on the differentiation of functions defined merely as fragments of code written in any programming language you prefer.

Here I'd like to outline the benefits of this approach and describe my own implementation of FAD algorithms in C++.

FAD can be helpful in embedding differentiation capabilities in your programs. You don't really need to know how it works to use it effectively. (Still you probably should know what differentiation is and how the derivatives of a function can be used.) This method really shines when you need to differentiate a function defined iteratively or recursively, or if the respective code fragment contains branches etc.

There is an excellent article here on CodeProject on how to implement symbolic differentiation. It also contains some notational conventions and differentiation formulae for useful functions, so I won't repeat it all here.

Here dy/dx denotes a derivative of y with respect to x (a partial derivative if y also depends on arguments other than x).

FAD basics

There are only two things you need to know to understand how the automatic differentiation works. Fortunately they are quite easy to understand. (Although, I repeat, they are not absolutely necessary to be able to use FAD.) The first one is the so called "chain rule" of differentiation.

Suppose we have a composite function y(x) = f(g(x)) and g(x) and f(t) are differentiable at x == x0 and g(x0) respectively. Then y is also differentiable at x == x0 and its derivative is
dy/dx = df/dt * dg/dx,
where dg/dx is computed for x == x0, df/dt for t == g(x0). If in turn g() function is composite (for instance g(x) = h(k(x))) we can apply this rule repeatedly extending the "chain of derivatives" until we get to the functions that are not composite ("elementary functions"):
dy/dx = df/dt * dh/du * dk/dx,
where x == x0, t == g(x0), u == k(x0).
Note that f(), h() and k() functions are "elementary" (like sin, cos, exp etc.), so we already know the exact formulae for their derivatives.

This rule can be generalized for the functions of more than one argument. The formulae are slightly more complex in this case but the principle remains the same.

Now recall that many of the computational processes operating on floating-point values are nothing more than an evaluation of some composite function. For a given set of input data the respective program executes a finite sequence of "elementary" operations like +, -, *, /, sin, cos, tan, exp, log etc. The conclusion is simple and straightforward. If we could process this sequence like a composite function we would be able to compute its derivatives, no matter how long it is. Fast automatic differentiation does exactly that. (See the [1] review for more details. There have been many publications on this subject in recent years. The ideas behind FAD are really not new.)

Several FAD algorithms require this sequence to be stored somehow for further processing, other algorithms don't. One of the most popular data structures to hold this sequence is a "computational graph" (the second thing that is important to understand FAD along with the chain rule described above). In a computational graph (CG) the nodes correspond to the elementary operations while the arcs connect these operations with their arguments. See the figure below for a sample CG for a function z(x) = f(g(h(x), k(x))). ("in" denotes any input operation.)

Sample computational graph

If the code fragment contains branches, loops, recursion etc., CG may be different for different inputs. That's why the CG is not a control flow graph or an abstract syntax tree (AST) or the like.

Automatic differentiation algorithms compute the derivatives of a function at the same point the function is evaluated in (unlike symbolic methods). Once the CG have been recorded one can traverse it either from the arguments to the functions of interest (from x to z in the example above, i.e. "forward") or in the opposite direction ("reverse") to compute the derivatives. Hence there are two main classes of FAD algorithms, namely "forward" and "reverse".

"Forward" FAD computes the derivative of every function with respect to one given argument in one CG traversal. (Although some forward algorithms don't require the CG to be actually stored.)
"Reverse" FAD computes the whole gradient (derivatives with respect to every independent variable) of one given function in one traversal of the CG. This can be beneficial when dealing with a small number of functions of many arguments each. Such functions can arise, for example, in matrix manipulations etc.

The "forward"-like FAD algorithm can also be used to estimate the inaccuracy of the function value accumulated during the computation. As you probably know, floating point values are often stored with the finite relative precision. For example, 1.0 and (1.0 + 1.0e-16) stored as double precision IEEE-compliant floating point numbers are usually considered the same. These "rounding errors can accumulate and lead to some uncertainty in the computed values. Rounding error estimation (REE) algorithms estimate these uncertainties (using the methods similar to the interval calculus).

A C++ Implementation

One common way to implement FAD algorithms in C++ is to use overloaded operators and functions for special "active variable" classes. (There are other ways, of course. Some of these other methods rely heavily on parsing, which is really beyond the scope of this article. One can find some information on these methods in [2].)

Consider the following code fragment:

    double x0, y0;
    double x, y, f;
    // get x0 and y0 from somewhere (ask the user, for example)
    //... 

    x = x0;    // initialize the arguments (x and y)
    y = y0;
    f = sin(y * cos(x));

In order to employ FAD algorithms you need to change the types of the variables of interest (x, y and f) from double to CADouble ("active double") provided by a FAD library. You also need to mark the beginning and the end of an active code section (a fragment that defines the function to be differentiated along with the initialization of its arguments). Then you get the following:

    double x0, y0;
    CADouble x, y, f;    // instead of "double x, y, f;"

    CActiveSection as;

    // mark the beginning of the 'active section'    
    as.begin();    // start recording operations in a computational graph (CG)

        x = x0;    // initialize the arguments (x and y)
        y = y0;
        f = sin(y * cos(x));

    as.end();     // stop recording, the active section ends

These code fragments look quite similar, don't they? In both cases f(x, y) is calculated but in the second one the appropriate sequence of elementary operations (the CG) is also recorded. The CG can be used later to compute the derivatives of f(), like this:

    double out_x, out_y;    // here the derivative values will be stored

    // compute df/dx partial derivative using forward mode FAD
    as.forward(x, f, out_x);

    // compute df/dy partial derivative using forward mode FAD
    as.forward(y, f, out_y);

Useful operators and functions are overloaded for CADouble instances, so that these operations not only do what they are usually meant to but also record themselves in the CG.

That's it. With a little change in your code you get the differentiation capabilities embedded in it. Of course, there is no free lunch. Using FAD can slow down execution of the affected code fragment by a factor of 2 to 400 (or even more, depending on the FAD algorithms you employ, your hardware etc.). Traversing a large CG involves a lot of memory access operations, which can be, of course, slower than simply performing a sequence of mathematical operations.

Note that methods that do not require the CG to be actually stored (the so-called "fast forward" and "fast REE" algorithms) are usually faster and need less memory.

Using the code (the "Reed Library")

The "Reed Library" I have written provides the following:

classes for active variables and active sections;
CG-based FAD methods: forward and reverse (both for computing 1st order derivatives of a double-valued function of one or more double-valued arguments);
CG-based method of rounding error estimation (REE);
implementations of "fast forward" differentiation method and "fast REE" algorithm that do not require the CG;
some utility stuff (CG caching, updating for new argument values without rebuilding the CG if possible, and more).

("Reed" is an acronym for "Rounding Error Estimation and Differentiation". It turned out that REE functionality was mandatory in my work while using FAD seemed to be optional. That's why REE is in the first place in this name. FAD was also very helpful to me more than once though.)

Detailed reference of the classes and methods you can find in the provided Doxygen-generated documentation (see /reed/doc/reed library v01.chm). I'll mention only key points here.

Most of the useful stuff in the library resides in namespace reed.
Currently supported elementary operations are:

arithmetic operations (+, -, *, /), as well as combined operations like +=, -=, *=, /=;
fmin(x, y) and fmax(x, y) - minimum and maximum of two values (x and y);
conditional assignment (condAssign(x, y) == y if x is positive, 0 otherwise);
abs(x) - absolute value of x;
sin(x), cos(x), tan(x);
exp(x), log(x);
sqrt(x).

This set of operations can be extended in the future.

Several of the operations above are not everywhere differentiable (e.g. abs(x) has no derivative at x == 0). When evaluated at these points, some value will still be computed, the system won't crash but, of course, this value cannot be trusted. (Perhaps I'll make a flag or something like this to indicate this condition in future versions. Unfortunately I have no time for this now.)

Relational operations (==, >, < etc.) are also overloaded for CADouble and CADoubleF<...>. In CG-based methods they are used to detect whether CG structure must change for new values of the arguments. In "fast" methods they simply return the result of comparison of the respective floating point values.

Value of an active variable can be accessed via setValue() and getValue() methods.

Instances of reed::CADouble class can be created both within and outside an active section. But (for efficiency reasons) you can assign something to them and use the overloaded operations for them within an active section only. Failure to do this will cause an assertion in debug mode and undefined behaviour in release.

To store CGs I chose to create a special memory allocator for fixed-size objects that is much faster than using the default new and delete. See reed/mem_pool.h. One can find information on building such allocators, for instance, in [3].

All CG-based FAD algorithms are implemented as methods of reed::CActiveSection class. Among these are:

int forward(CADouble arg, CADouble func, double& out);
Computes the derivative of function func with respect to argument arg using forward mode FAD and return the result in out. Return value is insignificant for now.

int reverse(CADouble arg, CADouble func, double& out);
Does the same as above using reverse mode.

int ree_lower(std::vector<CADouble>& arg, std::vector<double>& err);
Computes the lower estimate of an error. By default the arguments are supposed to be stored with a machine accuracy (i.e. with relative precision about 1.1e-16 for an IEEE-compliant normalized double value.) You can override this by setting initial absolute error values for the variables in arg. These values are passed in err vector. After ree_lower() returns, you can retrieve the accumulated error value for any variable via CADouble::getAcc() method, e.g. double e = f.getAcc().

There are other overloaded versions of these methods that can be useful. See documentation and the examples provided with this article for details.

To use these methods you should #include reed/adouble.h and reed/asection.h in your code and link your project with reed/lib/reed.lib library (or reed_d.lib for a debug version). See the provided examples ("simple", "iter" and "gradient").

"Fast" forward and REE algorithms are implemented in reed/adouble_f.h. You don't need any active sections or CG to use these. Nor do you need reed.lib. Just #include reed/adouble_f.h in your project, add util.cpp to it and you are all set. The "iter_f" example shows how you can use these "fast" algorithms.

Specials

As I mentioned above, you can use FAD algorithms (both CG-based and "fast" versions) to differentiate iteratively defined functions, functions with branches etc. provided the functions are differentiable at the point of interest. Consider the following example (see "iter_f" project from the zip archive with the examples for a complete implementation). Suppose y = y(x) is defined like this:

    double x0 = ...; // set the argument value or ask the user for it
    size_t m = ...;     // Number of iterations to perform. Let the user define it too.
    
    // Note: the value of m can be unknown at the compile-time. FAD will handle this.

    double x, y;
    x = x0; //argument
    y = x;
    for (size_t i = 0; i < m; i++)
    {
        if (y < 0.5)
        {
            y = 4.0 * y * (1.0 - y);
        }
        else if (y > 0.5)
        {
            y = 1.0 - 0.75 * y * y;
        }
        else
        {
            y = -1.0;
            std::cout << "1/2 has been hit!\n";    // shouldn't get here
            break;
        }
    }

Need to compute dy/dx at x == x0? No problem. The resulting code fragment uses "fast" forward method, but you can use CG-based methods too. Change types as described above, set initial derivative value (dx/dx == 1.0) and you'll get

    double x0 = ...; // set the argument value or ask the user for it
    size_t m = ...;     // Number of iterations to perform. Let the user define it too.

    CADoubleFF x, y; // for REE use CADoubleRee instead.
    // Active sections are not necessary for "fast forward" and "fast REE"
    // algorithms because these algorithms do not require CG to be stored.

    x = x0; //argument
    x.setAcc(1.0);    // Accumulated derivative is 1.0 in x (dx/dx). By default it is zero.

    y = x;
    for (size_t i = 0; i < m; i++)
    {
        if (y < 0.5)
        {
            y = 4.0 * y * (1.0 - y);
        }
        else if (y > 0.5)
        {
            y = 1.0 - 0.75 * y * y;
        }
        else
        {
            y = -1.0;
            std::cout << "1/2 has been hit!\n";    // shouldn't get here
            break;
        }
    }

The value of the derivative dy/dx will be computed along with the y(x) at x == x0. Then you can retrieve these values via y.getAcc() and y.getValue(), respectively. Pretty easy, isn't it?

Description of the examples

The "FADcpp_examples" Visual Studio solution contains 5 projects. It was tested under MS Visual C++ 7.1 only. (Perhaps it won't work under 6.0 or earlier due to some template tricks I used. I haven't tested it under 6.0 though.)

"FADcpp_examples/reed" directory contains the "Reed Library" itself (code and documentation). You can rebuild the library using "reed_lib" project of this solution.

Other 4 projects are the examples of using the library:

"simple" - a simple example. Computes the derivatives of f = sin(y * cos(x)), x and y supplied by the user. Uses both forward and reverse CG-based algorithms. Demonstrates REE too. Also shows how to reuse the existing CG to compute derivatives at another point (if possible).

"iter". Differentiates the iteratively defined function (see above). The argument and number of iterations are supplied by the user.

"iter_f". Same as "iter" except "fast" forward and REE algorithms are used instead the CG-based ones.

"gradient". Computes the gradient of a quadratic function f(x) = (x * Ax), where A is a 3x3 matrix, x = (x0, x1, x2), using the reverse FAD algorithm. Only a single CG traversal is used in this case. (If the forward algorithm were employed, it would require 3 CG traversals).

Conclusion

This article is an outline of the so-called fast automatic differentiation (FAD). FAD can be of help when you need to embed differentiation capabilities in your program and/or to handle functions with branches, loops recursion etc. This way automatic differentiation can complement symbolic differentiation. The latter is mostly used when we have a formula ("closed expression") for a function to be differentiated. If we don't, FAD can often help. Hope my implementation of the fast automatic differentiation and rounding error estimation algorithms will prove useful to you.

Give it a try! Any feedback will be appreciated.

References and recommended reading

M.Iri. History of automatic differentiation and rounding error estimation. Automatic Differentiation of Algorithms: Theory, Implementation and Application, A.Griewank and G.F.Corliss, eds., SIAM, Philadelphia, PA, 1991.
C.Bischof, M.Buecker. Computing Derivatives of Computer Programs.
D.Bulka, D.Mayhew. Efficient C++ Performance Programming Techniques. Addison Wesley, 1999.

History

v.0.1 The first versions of this article and the examples are available.