Download source - 38.4 KB

Introduction - The Inspiration
Ideas Covered in this Article
So, why would I (the reader) want to use this?
Why would I want to use such an archaic container? ... and what are the benefits of lock-free coding?
Getting Started: Seeing the stack in action!
An example - an intense multithreaded push/pop operation
A simple container for your data
First Up: the stack::push
So, what's happening here? (stack::push)
The stack::push Loop
Afterward: Ideas on where to go from here:

Introduction - The Inspiration

I was sitting at my computer, looking for inspiration about some new code to write. And I found myself wondering: what did I really want to create?

Then, it came to me. I wanted to write a cross-platform foundational framework that would be suitable for use by anyone - from someone casually writing a command line app to a large corporation writing a triple-A video game. No big deal. I know what I'm doing. Shouldn't take long, right? (cue audience laughter)

The idea came from my past. What were the most interesting things I've worked on? What were the most difficult? What projects did I truly enjoy creating? Memory management. Multi-threading. Coding across independent platforms. Those were all a bit frustrating at times. And, for the most part, I'm guessing those things are frustrating for everyone.

I considered: what would the project look like?

I know! It must be fast. Really fast. And multi-threaded. And I need a custom memory allocator too. 'Cause man, sometimes calling new and delete is slow. Getting memory must be fast!

So then I got started.

I quickly remembered all of the problems I had when writing code like this in the past. I'd start writing a thread pool, and I'd think about how I'm going to pass objects back and forth. I remembered times in the past that I would profile code, only to see how memory allocation was the bottleneck a lot.

That can't happen here. Sure, that's no big deal for many projects, but it's not acceptable for this one. If I'm going to allocate 20 or 20,000 bytes of memory to send something to another thread, it must be instantaneous. By that, I mean a millisecond is too slow, and I'm looking on the level of about a dozen or so processor cycles. I dream big!

The more I thought about it, the more I realized I wanted to write the fastest code there is. And to do that, I was going to require a lock-free container. It is simply the best way to improve throughput, and reduce latency. (cue dramatic music)

Ideas Covered in this Article

Multithreading - The idea of your computer doing more than one thing at a time.

Data Synchronization - Discussion about how to main data integrity in a multithreaded environment.

Lock-based synchronization - A widely used method of data synchronization. Common examples are critical sections and mutexes.

Lock-free synchronization - An alternate method of synchronizing data where threads don't block each other from continuing to do their thing. There is still technically a lock involved, however the lock is acquired and released by the processor all in one instruction.

ABA Problem - A typical shortcoming of the widely available compare_exchange instructions is the inability to verify that the data we think we're interacting with hasn't changed. The ABA problem is solved in the lock-free stack in this article by using identification numbers.

compare_exchange in c++ - Documentation on both compare_exchange_weak and compare_exchange_strong. Here is an article talking about the difference between the two.

So, why would I (the reader) want to use this?

Ok, great, I've told you why I needed to write this, but what purpose could you have for using such a thing?

This is something to consider using if you're writing any multithreaded code that wants to send some sort of data between threads. It's most useful in cases where you don't care about the order of the items in your container. There's no sorting here. Heck, you won't even get a way to enumerate all of your items. All you get is a push and a pop.

Ok, ok. If you're thinking that sounds ridiculous, and wondering why you would ever want to use such an archaic container, then...

Why would I want to use such an archaic container?

... and what are the benefits of lock-free coding?

Took the words right out of my mouth. Yeah, it's a good question. Here's the simple answer: when writing multi-threaded code, you pay for everything. That vector you're so comfortable with? Yeah, it's pretty expensive to synchronize that thing. Sure, you can do it. But that's where I come in, and offer you a choice you may not have realized you had before.

See, this ol' archaic stack has something big going for it in the land of multithreading. It. Is. Simple.

And what I mean by simple has to do with what's going on under the hood. Just because it's a stack doesn't mean much. I could write a thread-safe stack that would have abysmal performance. You don't want to read about that though, now do you? Thought not :D

So where were we? Oh yeah. Simplicity. This stack over here doesn't use any locking mechanism to secure your data - at least not in the traditional sense. The new c++11 standard has come with some cool new toys, as in, we now get to use atomic compare_exchange instructions in c++ without breaking into platform specific assembler!

These compare_exchange doohickeys are amazing. I'll go into more detail soon, but they allow you to lock down a memory location, check if the value there is what you expect it to be, and then if it is, that memory location is updated with a new value. And it returns a true/false on whether or not everything went through. You get all of that in one atomic instruction.

Getting to do all of that in one instruction is what drives lock-free programming. It means there's a new way to synchronize data in town, and you can throw out those dirty kernel mutexes. Ok, well, not quite. But we're getting there. (side note: I think kernel mutexes have a place for the forseeable future, but their place will be less common over time)

So in the end, I can't tell you if using this stack will be right for you in the project you're working on. But if it is, there is an opportunity for higher throughput and lower latency than traditional locking mechanisms. Atomic spin-locks based on compare_exchange functions generally benchmark to be roughly equivalent in performance to lock-free stacks. They might be the topic for another article.

Here are some big advantages of lock-free coding over traditional methods:

There will never be a deadlock as a result of lock-free synchronization. It is simply impossible, as there is no lock to become deadlocked with.
Threads will always maintain forward motion. Using even the fastest locks for very short time periods can result in high latency blocks across multiple threads. An example of this is when the operating system preemptively yanks away your program from doing it's thing to put someone else at the cpu to get their turn. With lock free coding, when a scenario like that happens, it only delays the thread that is active, whereas with a lock it delays the active thread and all threads that also want the lock.

There's a lot of potential with this lock-free stuff, and quite frankly I think it's uber-cool. It's the next big thing. And what I find most fun about it is that it requires a different way of thinking and reasoning than before.

Getting Started: Seeing the stack in action!

An example - an intense multithreaded push/pop operation

This is the exciting part! Ok, well, it's exciting for me at least...

Here's an example of what you can't do with a normal container: safely add and remove items across multiple threads at the same time without a lock!

// These are the testing parameters, you can change them
#define data_count 10
#define loop_count 1000000
#define thread_count 4

// This is the test function
// It uses those numbers set at the top

// Using new and delete is part of the test - for our new/delete calls to not crash,
// the data inside the stack must be handled properly. When we don't crash, and there is no
// "lost data", we know that everything went properly when playing with the stack.

void thread_test(stack *s, std::atomic *max_elapsed, std::atomic *empty_count, size_t index)
{
    // Initialization - create the data we'll test with
    data* d[data_count];
    for (size_t i = 0; i < data_count; ++i)
    d[i] = new data;

    std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();       
    // This is the test loop
    // Push and pop x number of times and see if everything comes out ok on the other side.
    // Also a good working sample of moving data into and out of the stack.
    for (size_t loop = 0; loop < loop_count; ++loop)
    {
        for (size_t i = 0; i < data_count; ++i)
        {
            if (d[i])
                s->push(d[i]);
        }

        for (size_t i = 0; i < data_count; ++i)
            s->pop((node*&)d[i]);
    }

    std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
    std::chrono::milliseconds span = std::chrono::duration_cast(finish - start);

    std::cout << index << " - thread completed : " << span.count() << "\r\n";
    *max_elapsed = span.count();

    // If the test is successful, every location will hold a valid pointer, and no pointers
    // will be duplicated. We test for a valid pointer, and by using delete we ensure that we
    // didn't have the same data in two places - deleting the same data twice would crash.
    //	    - may not crash on all platforms, so this test isn't meant to have total cross-platform support
    for (size_t i = 0; i < data_count; ++i)
    {
        if (d[i])
            delete d[i];
        else
            (*empty_count)++;
    }
}

// This is the place where we get stuff done. Start up the test threads, wait for them, and then
// check the results to display to the user afterwards. No big deal :)
int main(int argc, const char * argv[])
{
    std::thread threads[thread_count];
    std::atomic max_elapsed{ 0 };
    std::atomic empty_count{ 0 };
    stack s;

    std::cout << R"_(///////////////////////////////////////////////////////////////////////////////
///                         SIMPLE LOCK FREE STACK                          ///
///           Copyright (c) 2014 Michael Gazonda - http://mgaz.ca           ///
///                              CPOL Licensed                              ///
///       See CPOL.htm, or http://www.codeproject.com/info/cpol10.aspx      ///
///               http://www.codeproject.com/Articles/801537/               ///
///                 https://github.com/MGaz/lock_free_stack                 ///
///////////////////////////////////////////////////////////////////////////////)_";
    std::cout << "\r\nstarting\r\n";

    // Start threads
    for (size_t i = 0; i < thread_count; ++i)
        threads[i] = std::thread(thread_test, &s, &max_elapsed, &empty_count, i);

    // Wait for them to all finish
    for (size_t i = 0; i < thread_count; ++i)
        threads[i].join();

    // Output information
    size_t operation_count = data_count * loop_count * thread_count * 2;
    if (!empty_count)
        std::cout << "no lost data\r\n";
    else
        std::cout << "___lost data count___: " << empty_count << " of ";
    std::cout << data_count * thread_count << "\r\n";
    std::cout << "thread count         : " << thread_count << "\r\n";
    std::cout << "target processor bits: " << PROCESSOR_BITS << "\r\n";
    std::cout << "total pushes and pops: " << operation_count << "\r\n";
    std::cout << "operations per second: ";
    std::cout << operation_count / (max_elapsed.load() > 0 ? max_elapsed.load() : 1) * 1000 << "\r\n";
    std::cout << "processing time      : " << max_elapsed.load() << "ms\r\n";
    std::cout << "press any key to exit\r\n";
    std::getchar();
    return 0;
}

So the cool part of this to keep in mind is that you can push and pop to your heart's content from any thread without worrying about the integrity of your data container. Please keep in mind that you will still have to be thread-safe with the data itself.

The Internals: How the stack works

Let's Take a Look at `stack.h`

#include <atomic>
#include "node.h"
class stack
{
public:
    void push(node* n);
    bool pop(node*& n);
protected:
    std::atomic<node> head_;
};

Not much there. We get a push, a pop, and there's a head_ element. It doesn't look like much, however the lack of all of the fun toys that we get in more advanced containers is what allows this one to be extremely efficient.

Since you'll want to attach your own data to the nodes, let's take a quick look at how to do that:

A simple container for your data: `data.h`

class data : public node
{
    // Put your data in here
    // You can make your own data structure to throw in the stack, but it _must_ derive from node
	
    // Just an example, serves no purpose other than that - you may remove
    int x_;
    char buffer_[10];
};

So really, all it's going to take is for you to create a class or struct that derives from node. Does it have to derive from node like that? No, there are other ways to do it, but they are out of the scope of this article. This is the way to get you up and running now. Ok, let's take a look at how the stack works:

First Up: the `stack::push`

void stack::push(node* n)
{
    node old_head, new_head{ n };
    n->n_ = nullptr;
    while (!head_.compare_exchange_weak(old_head, new_head))
    {
        n->n_ = old_head.n_;
        new_head.create_id(old_head);
    }
}

So, what's happening here?

The first thing that I'd like to draw your attention to is that there is only one atomic operation (though it is in a loop). The alternative is to start with a head_.load(), but I chose not to. Primarily, this is because I prefer less atomic operations when possible for readability and reasoning purposes. I follow-up with I don't think there is any performance benefit to be gained by pre-loading the head value before running compare_exchange, though my tests aren't extensive at this point. YMMV.

node old_head, new_head{ n };
n->n_ = nullptr;

This is the initialization. We set the new node's (n) next pointer (n->n_) to nullptr. We have no idea what is in the head element at this time, but we're going ahead as if the stack is completely empty. To handle the empty stack scenario, we set the new head to point to our new node, and we set our new node's next pointer to nullptr.

while (!head_.compare_exchange_weak(old_head, new_head))

This is gold! That is the atomic push (and pop) operation! Here's what this statement is saying:

LOOP
    LOCK head_
        IF head_ == old_head
            head_ = new_head
            success = true
        ELSE
            old_head = head_
            success = false
        END IF
    RELEASE LOCK head_
    
    IF success
        BREAK LOOP
    END IF
	
    {
        DO __loop_code__
    }

REPEAT LOOP

So that's what you're getting with an atomic compare_exchange loop. It's a lot, and I thought it would be helpful to break it up into a more traditional way of reading what the code is doing.

The most amazing thing isn't that we're getting all of this cool stuff, but that it's all happening as one uninterruptible processor instruction. Much of the time, you won't care if all of those things happen at the same time or not, but this is different. The reason for going to all of the trouble of being able to do so much all in one operation is so that you can perform these operations in a multi-threaded application without concern for the problems that can occur otherwise.

To consider this in another way: when you work with multiple threads, you have to consider that "anything" can happen in between one instruction to the next. The way to avoid the problem of something we don't want happening in between all of the little parts of our code is to use a lock, or to use an instruction (compare_exchange) that does a bunch of useful stuff that can't be broken up.

The `stack::push` Loop

// loop
{
    n->n_ = old_head.n_;
    new_head.create_id(old_head);
}

This code is run every time we try our compare_exchange operation, and it fails. There are some very important things that happen here to get ready for the next compare_exchange attempt.

First, we set our node's next pointer (n->n_) to point to the current head of the stack's next pointer (old_head.n_).
Next, we set the identification number of the new_head item that will soon become the head of the stack to 1 plus what the current head's id number is. (this is required to solve the "ABA" problem described in the further reading section at the top)

Doing all of that ensures that when the compare_exchange succeeds, the stack is in a valid state afterwards.

And Now, For the `stack::pop`!

bool stack::pop(node*& n)
{
    node old_head, new_head;
    n = nullptr;
    while (!head_.compare_exchange_weak(old_head, new_head))
    {
        n = old_head.next_pointer();
        if (!n)
            break;
        new_head = node { n->n_, old_head };
    }
    return n != nullptr;
}

A lot of this is very similar to the push operation. It's just a little bit different.

node old_head, new_head;
n = nullptr;

In the push, we initialized next to "n", and now we initialize the value to nullptr. Similarly, when pushing we initialized n->n_ to nullptr, and now we initialize n to nullptr. Same idea as to how things work, we're just going in the opposite direction as we were before.

while (!head_.compare_exchange_weak(old_head, new_head))

The most important line in both the push and the pop are exactly the same! This is cool, and also helps with understanding the code. On both a push and a pop, we need to have the "new_head" ready every time we run the compare_exchange. The setup is a little different, but that's all it is.

{
    n = old_head.next_pointer();
    if (!n)
        break;
    new_head = node{ n->n_, old_head };
}

This is our pop loop. It's got a little bit more going on than the push loop, but the idea is the same.

When our compare_exchange succeeds, we want the node* that we return to point to the node that was popped off of the stack. That's what the n = old_head.next_pointer(); is doing.

Next, we want to test the pointer to see if there's anything there. If there's not anything there, as in (!n) evaluates true, then we exit the loop and return false. If that happens, it means we tried to pop the stack, but there was nothing there.

Last up, we create the new_head. It's going to point to the second item on the stack (or nullptr if there isn't one). We also need to give the new_head an id number, and so we pass the old_head through as well. The new_head's id will be one greater than the old_head's id.

The magic in node.h:

struct node
{
    node *n_;
#if PROCESSOR_BITS == 64
    using stack_id_t = uint16_t;

    inline node() : n_{ nullptr }                    { }
    inline node(node* n) : n_{ n }                   { }
    inline void create_id(const node& nid)
    {
        ((stack_id_t*)this)[3] = ((const stack_id_t*)&nid)[3] + 1;
    }
    inline node* next_pointer()
    {
        return (node*)((uint64_t)n_ & 0x0000ffffffffffff);
    }
#elif PROCESSOR_BITS == 32
    using stack_id_t = uint32_t;
    stack_id_t t_;

    inline node() : n_{ nullptr }, t_{ 0 }           { }
    inline node(node* n) : n_{ n }, t_{ 0 }          { }
    inline void create_id(const node& nid)           { t_ = nid.t_ + 1; }
    inline node* next_pointer()                      { return n_; }
#else
    static_assert(false, "unknown pointer size");
#endif
    inline node(node* n, const node& nid) : n_{ n }  
    { 
    	if (n_) 
            create_id(nid); 
    	else 
    	    create_id(*this); 
    }
};

It's a lot for a simple node, huh? It may look a little funky, so let me explain.

Remember those id numbers that came up in the stack? Well, here's how that works. On a 32 bit system, we simply attach another 32bit number to our node, and we then run our atomic compare_exchange operations on the 64 bits of node and it's identification number.

It's not quite that simple for 64bit computing, at least for now. When attaching a second 64 bit number to the node, the current C++ compilers I tried (clang, and Visual Studio 2013) didn't allow for 128 bit lock-free compare_exchanges. I could write the code, and run it, but under the surface, the compiler was turning my lock-free code into code that used locks.

Luckily, on x64 systems, only 48 bits of pointers are used for memory addressing. The other 16 bits are unused. That gives us the space we need to store id numbers with the pointers when we run the compare_exchange operations. The code above is an implementation of this.

And, that's it! A lock-free stack.

Afterward: Ideas on where to go from here:

Find information in your code that requires synchronization, and see if this solution would be appropriate
Make a thread pool that doesn't use locking for a message container.
Advanced - create a lock-free memory allocator
Look into common library calls, and whether or not they can be blocking. Are new and delete blocking? How can you write lock-free code if they are?

History

Version 1 - July 26, 2014
Version 2 - July 30, 2014 - Added more background information, and how/why someone might want tochoose this as a solution for their projects. Improved formatting. Added working project files for compiling on Linux (Code::Blocks). Changed license to CPOL.
Version 3 - Aug 7, 2014 - Added Table of Contents, minor formatting improvements + fix of code typo.