Compile Time Loops with C++11 - Creating a Generalized static_for Implementation

Michael Gazonda

Rate me:

4.80/5 (15 votes)

28 Dec 2014CPOL11 min read

46.4K

199

Have you ever been working with templates or constexpr and wanted to run a loop? Or maybe you have a loop you'd like to unroll and see if your program will run faster? Welcome to static_for.

Download static_for.zip - Includes project files for Visual Studio 2013 and Xcode 6

Introduction
Requirements
The Problem
Related Topics
Getting Started - Trying to work with recursion
The recursion problem in detail, and the fix
The good stuff - Writing a generalized static_for implementation
The code
A test program
Code commentary
How can I choose between a compile-time loop, and a run-time loop?
Questions and comments
History

Introduction

I recently came across a problem that I found interesting. How can I easily loop at compile time?

I'd thought about writing a static_for method a few times in the past to simplify some template code, but until now there was no pressing need for it.

I recently found my reason to write this one. I needed to test some compile-time code, and with that came the requirement to loop through it. The built-in tools weren't quite enough, and so... here I am, and here is my static_for!

By looping at compile time, you can interact with compile-time code (such as templates).

static_for will also unroll loops.

Requirements

A compiler that supports C++11 or later. The download contains a workaround is provided for Visual Studio's current lack of constexpr support by using enums, and more templates.

The Problem

Looping at compile time. Looping at run-time is easy:

C++

for (int i = 0; i < upper_bound; ++i)
{
    do_stuff(i);
}

Since we're working at compile time, the solution isn't that simple. I'd like to do:

C++

template<int n> void do_stuff()
{
    // stuff
}
for (int i = 0; i < 1000; i++)
{
    do_stuff<i>();
}

But... this doesn't work. i is a runtime counter, and do_stuff<int>() requires a compile-time counter. Because I don't want to write:

do_stuff<0>
do_stuff<1>
do_stuff<2>
do_stuff<3>
do_stuff<4>
...
do_stuff<997>
do_stuff<998>
do_stuff<999>

And of course, that would still require me to hard-code the number of iterations. I need to loop to a number that is calculated through compile time expressions, and yet have the same run-time performance as hard coding above.

Today I'll walk you through a bit of my process in creating a generalized solution to this problem, as well as show you some of the potential pitfalls and my reasons for needing to create something of this complexity.

Or, you can just download the code, and play around with it :D

Does unrolling a loop increase performance?

Template argument deduction

Template instantiation

Template meta-programming

Generic programming

n-ary/k-ary tree

Getting started - Trying to work with recursion

So let's say that I want to instantiate this template with values from 0 to 99, and call the function:

C++

#include <iostream>
template<int index> void do_stuff()
{
    std::cout << index << std::endl;
}

This is the type of problem I'm looking to solve here, as it's not one that we can handle with a normal run-time loop. The initial naive approach to solving this problem will involve recursion. Maybe this is good?

C++

template<int max_index, int index = 0> void stuff_helper()
{
    if (index <= max_index)
    {
        do_stuff<index>();
        stuff_helper<max_index, index + 1>();
    }
}
int main()
{
    stuff_helper<100>();
    return 0;
}

Depending on how familiar you are with working with templates in C++, you probably either think this looks good, or you're already familiar with why this won't work. Or maybe you think this looks good and know why it won't work.

On the surface, it could look like the if statement would be responsible for terminating the recursion, like how this would work with a "normal" run-time based recursion algorithm. But that's the problem. What works at runtime doesn't work at compile time.

This is an infinite loop, and only stops because compilers limit themselves to a certain recursion depth. In clang, I get an error fatal error: recursive template instantiation exceeded maximum depth of 256. You can expect a similar error with your compiler of choice.

Fortunately, there's a reasonable way to fix this.

The recursion problem in detail, and the fix

So the problem with the first attempt is that templates don't always work the way you might intuit how they work. I've found it a bit frustrating to learn myself, and this seems to be a common difficulty. However, once the quirks are understood and accepted, they don't have to be a problem any longer.

Your trusty C++ compiler will attempt to instantiate the template regardless of "normal" logic that is run-time based.

Which means that when we do this:

C++

template<int max_index, int index = 0> void stuff_helper()
{
    if (index <= max_index)
    {
        do_stuff<index>();

The if logic doesn't kick in when we want it to. The if has nothing to do with compile-time decisions, and your lovely compiler is obligated to attempt to compile

C++

        stuff_helper<max_index, index + 1>();

... even when you're at or past your max_index!

So what we need is something a little different. What we need here is a way for logic to branch at compile time rather than at run-time.

How about using specialization?

C++

#include <type_traits>

constexpr int terminate_index{ 100 };

template<int index = 0> void stuff_helper();
template<> void stuff_helper<terminate_index>()
{
}    
template<int index> void stuff_helper<index>()
{
    do_stuff<index>();
    stuff_helper<index + 1>();
}

int main()
{
    stuff_helper();
    return 0;
}

Nope, that doesn't work. C++ doesn't allow partial specialization of functions. Thankfully, there's a workaround to simulate partial specialization in this case: we can use tags. Tags in this context are empty types that will separate one function signature from another. So let's use std::integral_constant, and try this:

C++

#include <type_traits>

constexpr int terminate_index{ 100 };

void stuff_helper(std::integral_constant<int, terminate_index>)
{
}
    
template<int index = 0>
void stuff_helper(std::integral_constant<int, index> = std::integral_constant<int, 0>())
{
    do_stuff<index>();
    stuff_helper(std::integral_constant<int, index + 1>());
}

int main()
{
    stuff_helper();
    return 0;
}

Success! But how does it work?

This relies on the compiler's method for matching a function call to the proper function and makes use of template parameter deduction and tag dispatching.

The first time we call stuff_helper, only the second (templated) function matches the call, so that one is chosen.

Recursive calls up until terminate_index are also pretty straight-forward. We pass the std::integral_constant, and the template parameter is deduced by the compiler to match the int inside of the integral_constant.

std::integral_constant works here because every index creates a new type, and a new function signature. In essence std::integral_constant<int, 0> and std::integral_constant<int, 1> are completely different types even though they share an underlying template.

The last call is the special one though, as that's where there are two valid function signatures to choose from. When the inevitable call to stuff_helper(std::integral_constant<int, max_index>()); is made, the compiler must choose the proper one. But which one is that?

The compiler will choose a function without a template before one that does have a template. In effect, since the compiler can find a function signature that works without having to instantiate a template, it chooses that one without looking at the templated one. This process is known as overload resolution under the heading "best viable function".

Oh, and don't worry about the runtime performance cost of passing the unused std::integral_constant around. Unnamed types are optimized out in release mode on all compilers I'm aware of.

But, this code has its limits. What happens when we change our code to loop a few more times?

C++

constexpr int max_index{ 1000 };

Kablooie! We hit our template recursion error again. Or, at least I do on my compiler which has a maximum recursion depth of 256.

There's an easy way, and a fun way to work around this.

The easy way is to tell your compiler to increase its maximum template depth past 256 to something more appropriate to your specific situation.

The fun way is to work within the limit of a maximum template depth of 256, and allow ourselves to loop an arbitrarily large number of times. It's like solving a tricky puzzle. Maybe you don't like puzzles though. *shrug* As an added bonus, the fun way provides a generalized solution that is useful in other places.

Also to note: I'm writing library code where I don't have control over the user's compiler settings. As such, it's very important for me to be able to solve this situation in a way that will "just work" on a modern C++ compiler without modifying compiler flags.

The good stuff - Writing a generalized compile-time `static_for` method

Here's the basic interface I've created:

C++

template<size_t count, typename functor_template, typename... functor_types>
inline void static_for(functor_types&&... functor_args);

template<size_t start, size_t end, typename functor_template, typename... functor_types>
inline void static_for(functor_types&&... functor_args);

The static_for method takes either a count or a start and end point, a typename that contains a templated function called func that will be instantiated by static_for, and a list of function parameters to pass on with C++11's perfect forwarding (which in turn are automatically deduced!). Oh, and it also takes an optional argument called sequence_width which is used to internally. More on that one below!

Though the declaration looks complicated, using it is simple. As an example:

C++

struct do_stuff
{
    template<int index> static inline void func()
    {
        std::cout << index << std::endl;
    }
};

int main()
{
    static_for<5000, do_stuff>();
    return 0;
}

Looks good right? So now we have our interface. Now it's just a matter of writing the code to implement it...

Unfortunately, we need to wrap func() in a helper struct, as there is no way to pass the name of a templated function directly. This is another short coming of the C++ language in regards to what we can do at compile time.

The code

My solution to loop at compile time without triggering template depth errors is to break up the problem into smaller pieces.

Basically, the initial naive approach with recursion goes in a straight line. No matter whether you want to loop 20 times, or 20,000, the basic approach just keeps going until it's done.

What I've done differently is to break that up into smaller pieces. The template parameter sequence_width can be used to set how big the smaller pieces will be. This structure is known as an n-ary tree, where n is equal to sequence_width. I originally wrote this with a binary tree, and I moved to an n-ary tree as a compile time optimization.

Although the code below is suitable for large loops, you will almost certainly run into other limits in your compiler, or a hard limit based on how much time will be required to compile your program.

In other words: this isn't a magic bullet, and although this will allow you larger compile time loops than without this, there are still other things you will be contending with. Also: compiling templates are notoriously slow.

I'm going to share with you the header file right here, as well as some comments about how this thing works. The full working code is provided below. A test function and project files are included with the download.

The public interface:

C++

template<size_t count, typename functor, size_t sequence_width = 70,
    typename... functor_types>
inline void static_for(functor_types&&... functor_args)
{
    static_for_impl<0, count-1, functor, sequence_width, functor_types...>::
        loop(std::forward<functor_types>(functor_args)...);
}

template<size_t start, size_t end, typename functor, size_t sequence_width = 70,
    typename... functor_types>
inline void static_for(functor_types&&... functor_args)
{
    static_for_impl<start, end, functor, sequence_width, functor_types...>::
        loop(std::forward<functor_types>(functor_args)...);
}

The internals with comments:

C++

template<size_t for_start, size_t for_end, typename functor, size_t sequence_width,
    typename... functor_types>
struct static_for_impl
{
    static inline void loop(functor_types&&... functor_args)
    {
        // The main sequence point is created, and then we call "next" on each point inside
        using sequence = point<for_start, for_end>;
        next<sequence>
            (std::integral_constant<bool, sequence::is_end_point_>(), 
             std::forward<functor_types>(functor_args)...);
    }

private:
    
    // A point is a node of an n-ary tree
    template<size_t pt_start, size_t pt_end> struct point
    {
        static constexpr size_t start_        { pt_start };
        static constexpr size_t end_          { pt_end };
        static constexpr size_t count_        { end_ - start_ + 1 };
        static constexpr bool is_end_point_   { count_ <= sequence_width };

        static constexpr size_t sequence_count()
        {
            return
                    points_in_sequence(sequence_width) > sequence_width
                ?
                    sequence_width
                :
                    points_in_sequence(sequence_width);
        }

    private:
        // Calculates the start and end indexes for a child node
        static constexpr size_t child_start(size_t index)
        {
            return
                    index == 0
                ?
                    pt_start
                :
                    child_end(index - 1) + 1;
        }
        static constexpr size_t child_end(size_t index)
        {
            return
                    index == sequence_count() - 1
                ?
                    pt_end
                :
                    pt_start + points_in_sequence(sequence_count()) * (index + 1) -
                        (index < count_
                    ?
                         1
                    :
                         0);
        }
        static constexpr size_t points_in_sequence(size_t max)
        {
            return count_ / max + (
                    (count_ % max) > 0
                ?
                    1
                :
                    0);
        }
           
    public:
        // Generates child nodes when needed
        template<size_t index> using child_point = point<child_start(index), child_end(index)>;
    };

    // flat_for is used to instantiate a section of our our main static_for::loop
    // A point is used to specify which numbers this instance of flat_for will use
    template<size_t flat_start, size_t flat_end, class flat_functor> struct flat_for
    {
        // This is the entry point for flat_for
        static inline void flat_loop(functor_types&&... functor_args)
        {
            flat_next(std::integral_constant<size_t, flat_start>(), 
                std::forward<functor_types>(functor_args)...);
        }

    private:
        // Loop termination
        static inline void flat_next
            (std::integral_constant<size_t, flat_end + 1>, functor_types&&...)
        {
        }
       
        // Loop function that calls the function passed to it, as well as recurses
        template<size_t index>
        static inline void flat_next
            (std::integral_constant<size_t, index>, functor_types&&... functor_args)
        {
            flat_functor::template func<index>(std::forward<functor_types>(functor_args)...);
            flat_next(std::integral_constant<size_t, index + 1>(),
                std::forward<functor_types>(functor_args)...);
        }
    };

    // This is what gets called when we run flat_for on a point
    // It will recurse to more finer grained point until the points are no bigger than sequence_width
    template<typename sequence> struct flat_sequence
    {
        template<size_t index> static inline void func(functor_types&&... functor_args)
        {
            using pt = typename sequence::template child_point<index>;
            next<pt>
                (std::integral_constant<bool, pt::is_end_point_>(),
                 std::forward<functor_types>(functor_args)...);
        }
    };

    // The true_type function is called when our sequence is small enough to run out
    // and call the main functor that was provided to us
    template<typename sequence> static inline void next
        (std::true_type, functor_types&&... functor_args)
    {
        flat_for<sequence::start_, sequence::end_, functor>::
            flat_loop(std::forward<functor_types>(functor_args)...);
    }

    // The false_type function is called when our sequence is still too big, and we need to
    // run an internal flat_for loop on the child sequence_points 
    template<typename sequence> static inline void next
        (std::false_type, functor_types&&... functor_args)
    {
        flat_for<0, sequence::sequence_count() - 1, flat_sequence<sequence>>::
            flat_loop(std::forward<functor_types>(functor_args)...);
    }
};

The current version of Visual Studio doesn't include full support for constexpr. The source code download has a version of static_for that doesn't require it.

A test program

C++

// This is a sample struct that is instantiated at compile time with static_for
struct do_stuff
{
    template<size_t index> static inline void func(bool* b)
    {
        assert(!b[index]);
        b[index] = true;
        std::cout << index << std::endl;
    }
};

int main()
{
#define for_count 2000

    // To show that this works, I create an array of bools

    // The bools are all initialized to false

    // On each loop iteration, the bool at the index is tested to make sure that do_stuff hasn't been
    // called at that index. After the loop, each bool is tested to ensure that it was set to true.

    // In effect, this test validates that each index is called once and only once.

    bool b[for_count + 100];
    memset(b, false, for_count);

    // ... and that we don't overshoot the end point
    memset(&b[for_count], true, 100);

    static_for<for_count, do_stuff>(b);

    for (size_t i{ 0 }; i < for_count; ++i)
        assert(b[i]);

    // When the program completes without error, our test has passed successfully

    return 0;
}

Code commentary

The static_for_impl is initialized with the information you pass it, and then it creates a point. Consider a point to be like a node of a tree.

If the point contains sequence_width items or less, then the point is marked as one that doesn't require splitting (the value is_end_point_ will equal true for that point).

If the point contains more than sequence_width items, then its start_ and end_ values will be split into child_points.

The child_points are checked to determine if they need further splitting.

If so, then each child gets a number of point items (types) equal to the parent's count_ divided by sequence_width.

Else, if no further splitting is required, then the number of children points are minimized to the parent's count_ divided by sequence_width.

Each point is fed through the internal flat_for template. In there, the point is used to either call more instances of flat_for with child points, or it is used to instantiate the functor that was provided with the initial static_for call at the proper index.

Overall, the code works by breaking apart the task of the for loop into points which are sequence_width or smaller. The points are internally structured as an n-ary tree. It then feeds each point through flat_for, which is used to call your code at each index that you requested when you call static_for.

How can I choose between a compile-time and run-time loop?

There are some cases where you don't have a choice.

I wrote this because I had a need to loop with other compile-time code, and in that case I could not use a run-time loop.

There are often times where there is logic related to the loop that cannot be known until runtime. An example would be a loop that requires external data to know when it terminates.

Much of the time, the question comes down to an engineering problem, as it requires weighing the pros and cons of both choices.

Compile-time/unrolled loops will take longer to compiler. They might allow for slightly higher runtime performance. This could be a benefit that is worthy of the slower compile time in some cases.

In many cases, the performance difference of saving a handful of nanoseconds will be irrelevant. And yet, in others saving a handful of nanoseconds will be of vital importance.

At the very least, it is good to be aware of the tradeoffs, and to know why you are picking one choice over another.

Questions and comments

I would appreciate any questions and comments you have about this. Let me know what you like, what you don't like, and how I can make this better.

Let me know what kind of template meta-programming topics you're interested in, and what you'd like to see me take on with a follow-up article.

History

v 1.0 - December 28th, 2014 - Initial release

v 1.1 - December 29th, 2014 - Additional info and cleanup

v 1.2 - December 30th, 2014 - Code cleanup, additional information, added test program and more "related topics"

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Michael Gazonda

Architect Zed Mind

Canada

I love writing good code, and am currently spending my time working on a project I love.

I'm a yoga teacher, and enjoy the outdoors which I never seem to get enough of. Hopefully soon...

If you'd like, you can connect with me on Facebook.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.