Skip to main content
Email Password   helpLost your password?

Contents

Part I

Part II

12th November, 2009: Re-wrote the article to explain even better the nitty-gritty details. Thanks everyone for the feedback.

Introduction

I attended Herb Sutter's Effective Concurrency Seminar (Europe 2009), a very inspiring seminar where he approached multi-core targeted programming. One thing we worked with was the dangers of Lock-Free programming and hazards of volatile in C++.

I got curious since I've seen a couple of simple single producer -> single consumer Circular Queue implementations that are implemented as Lock-Free and using just that, the important state variables defined as volatile. So how come they work (being used for years), or do they?

Volatile as you might know is not intended for multi-threading but for accessing hardware devices (and more). It turns out that volatile plays a minor role and that it's mostly the compiler and computer platform that decides whether or not this kind of Circular Queues will be safe to use.

I have implemented a thread safe circular queue in C++ and explain in this article when it works and when it will not work. This is an area outside the scope of the C++ standard (until C++0x) and is suitable for (at least) x86 or x64 platforms.

For those of you who want to read up on the inner workings of the Circular Queue, I have provided a short description of it and my implementation. For those of you who are already familiar with it and just want to know if this thread safe circular queue is for real, just go straight to the Implementation section.

Background

If an application does multiple tasks simultaneously, then multi-threading would be useful. In this particular example, I show a circular FIFO queue that can be used for communicating between maximum two threads. This scenario is often referred to as the single producer -> single consumer problem.

Example: High priority task (GUI, data recorder, etc.) that must not be delayed in lengthy operations and require them to finish in the same order as they are started.

The solution: Delegate the time consuming jobs to a worker thread. Using a FIFO queue makes it possible for the delegation and processing of tasks to be done asynchronously i.e. the producer and consumer each work in their own thread and therefore reads (remove) can be concurrent with writes (add). The FIFO qualities guarantee that delegated tasks are processed in the same order as they were added.

The producer can add new tasks to the queue as long as the queue is not full, and the consumer can remove tasks from the queue as long as the queue is not empty.

It's worth mentioning that Circular Queues are commonly used when:

But all of these examples are outside the scope of this article.

Text and Code Disclaimer

I decided to keep this example simple and not go into all the nitty-gritty details that an optimal solution might have. This is why any writes to a full queue or any reads from an empty queue are just plainly ignored. For the same reason, the queue is not optimized to avoid false sharing.

It can be argued that under some conditions like in my "test_of_fifo_queue" the queue will suffer from false sharing if both threads are on different cores. However under other conditions than in my silly test, this is not necessarily the case. What decides if there will be false sharing depends on what is stored within the array: integers, pointers or whole objects and if they are on the same cache line as both tail and head indexes are accessing.

IF there is false sharing, it's not always a very bad thing. It all depends on the frequency of the false sharing, i.e., how often the consumer/producer threads are trying to access the same cache line.

Implementation Overview

The simple circular queue is ideal when explaining the issues involved with lock-free programming. It is limited in scope but still very useful in real world practices.

I have implemented a simple circular queue (a.k.a. circular/ring buffer) to try to explain some of the issues involved in lock-free programming. I will try to describe why and when it works and why and when it will not work.

The approach of circular buffers I've seen previously have used a simple c-type array to store the data as raw bytes (using memcpy) with no regard to class type (if storing objects). At unpacking, the correct bytes are copied and reinterpreted to the correct type. I decided to use a different approach. In my example, I use a template style approach where the size of the Circular Queue and the type to store within are given as template arguments.

template<typename Element, unsigned int Size>
class CircularFifo {
public:
   enum {Capacity = Size+1};

   CircularFifo() : tail(0), head(0){}
   virtual ~CircularFifo() {}

   bool push(Element& item_);
   bool pop(Element& item_);

   bool isEmpty() const;
   bool isFull() const;

private:
   volatile unsigned int tail; // input index
   Element array[Capacity];
   volatile unsigned int head; // output index

   unsigned int increment(unsigned int idx_) const;
}; 

Notice that the tail and head are on opposite sides of the array. Not for any guarantees but at least it's a sporting chance that the tail and head indexes can be on different cache lines. It all depends on the array size. Buffering can always be applied to be sure.

The properties that decide the state of the queue are the array with Elements, the Size of the array and the indexes for current tail and head of the queue.

The Capacity of the array is one larger than the specified Size. This ensures a slack of one which makes it easy to distinguish between full buffer and empty buffer.

// Remember these?
enum {Capacity = Size+1};         // 1 extra for slack
volatile Element array[Capacity]; // gives easier full and empty determination

Circular FIFO - How It Works

Below I have described graphically some writes (push) by the producer thread and some reads (pops) by the consumer thread.

When the buffer is empty, both indexes will be the same. Any reads by the consumer will fail.

Empty Queue

Example of an empty queue
/// Consumer check for a an empty queue
if(head == tail)
    return false; // queue is empty

The Producer adds (push) a new item at the position indexed by the tail. After writing, the tail is incremented one step (or wrapped to the beginning if at end of the queue).

Queue with one item

Queue with one item

The Producer adds another item and the tail is incremented again.

Queue with two items

Queue with two items

The Consumer reads (pop) the item indexed by the head. The head is incremented one step.

Queue after pop of item1

Queue after pop of item1

If the producer pushes more items onto the queue than the consumer can keep up with, the queue will eventually become full.

Full Queue

Full Queue

When the buffer is full, there will be a one slot difference between head and tail. At this point any writes by the consumer will fail or else the write index would be the same as the read index and two bad things would happen. First item2 in the picture would be overwritten and second, the empty queue criterion would be satisfied even though the queue is full.

/// Verify that queue is not full
if(((tail+1) % Capacity) != head)
{
   ... // do stuff, we're not full
}

Implementation

The thread safety is ensured through the class design. Only two functions can modify the state of the object, push() and pop(). They are each used by a single thread only. The producer thread should only access push() and the consumer thread only pop().

The design can be further emphasized with access through interfaces, but I think it's enough to have clarified this in the comments.

Using more producer or consumer threads breaks this design and the queue can become corrupt. If more threads are wanted, then a different solution should be used, i.e. queue with write/read locks.

push() changes only the tail, but verifies that queue is not full (check of head) and pop() changes only the head but verifies that the queue is not empty (check of tail). Only the producer thread will update tail and only the consumer thread will update head. Thus, thread safety for updating the head and tail is ensured by the design.

However reading and writing of both tail and head must be thread safe in the sense that we do not want to read old, cached values and writes must be atomic.

Important: See the text further down regarding this. It states how atomic, non-reordered reads and writes are achieved for this Circular Queue. It is probably the most important part of this article.

/** Producer only: updates tail index after writing*/
template<typename Element, unsigned int Size>
bool CircularFifo<Element, Size>::push(Element& item_)
{
   int nextTail = increment(tail);
   if(nextTail != head)
   {
      array[tail] = item_;
      tail = nextTail;
      return true;
   }

   // queue was full
   return false;
}

/** Consumer only: updates head index after reading */
template<typename Element, unsigned int Size>
bool CircularFifo<Element, Size>::pop(Element& item_)
{
   if(head == tail)
      return false;  // empty queue

   item_ = array[head];
   head = increment(head);
   return true;
}

Improvements and Code Notes

This code is minimalistic and made to show how some C++ lock-free circular queues are made. It is in no way optimized and should be considered as an example and not as a fully packaged and finished solution to fit all needs. In any case, this queue can be a good starting point if you need lock-free circular queue between two threads.

Making It Work

Remember that for a multiple-producer or multiple consumer scenario, this code will not work.

First, it is imperative that the head and tail indexes must be updated only after the element (pop() or push()) is read or written. Thus, any updates to the head or tail will ensure proper access to the elements. For this to work, the read and writes of the head and tail must be atomic and must not be re-ordered somehow.

Atomic

An atomic operation is an operation that is guaranteed to finish and not be interrupted half-way by another thread.

On almost all modern processors, reads and writes of naturally aligned native types are atomic. This means that as long as the memory bus is at least as wide as the type being read or written the CPU reads and writes will happen in a single bus transaction. This makes it impossible for other threads to see them half-completed.

For x86 and x64 platforms, types larger than 8 bytes are not guaranteed to be atomic. But in our example the head and tail indexes are integers of size 32bits (4 bytes) on 32-bit processor or 64bits (8bytes) on a 64-bit processor, thus making reading and writing them atomic.

 // Example of integers
 counter++;                // 1. not atomic operation, but three operations
 counter = 0;              // 2. this write is atomic
 other_variable = counter; // 3. this read is atomic

For fun, you can compile the operations above to assembly language or use the debugger to view the disassembly. You should then see that the actual store only takes one instruction.

On multiple core CPUs, this is also true. A word in memory must be coherent between all cores when it is written (using one CPU instruction). It is simply not allowed or even possible to split a 4 byte writing (I use a 32-bit example) between cores, since a word (4 bytes) is written as a single instruction.

In short. Reads and writes of tail and head are atomic. First is still OK.

Reordering

Second, the read and writes of the head and tail must be not be reordered. Reordering can happen by both compilers and processors. In fact reordering can happen both to instructions and read, write operations. For the Circular Queue to be safe, this must not happen to our volatile read and writes (or else First will fail).

Compiler Reordering

Compiler rearrangement is dealt with in this particular case by our use of volatile. The C++ standard defines that reads of volatile variables cannot be cached nor delayed at writing. It also defines that volatile reads or writes cannot be moved past each other.

To quote hax_:

"'volatile' variable access is also code rearrangement barrier: no lines can be moved up or down through it (you can treat every access to volatile variable as horizontal barrier in code editor)".

Normally this wouldn't be enough for multi-threading programming since the compiler can reorder non-volatile read/write relative to volatile read/write. But for this circular queue it's all read/write of volatile. Either the indexes themselves are read or written or the array is accessed with a volatile. Thus, this type of queue is safe from compiler reordering that could break First.

CPU Reordering

CPU rearrangement rules on x86 and x64 are guaranteed to not reorder instructions. They also do not reorder write operations relative to other writes, or read operations relative to other reads. The exception to these rules are for some special operations of writes and reads that doesn't apply here. Important to know is that writes cannot move ahead of reads, but reads can move ahead of writes.

Reorder Summary

CPU Reordering Summary for x86 and x64 Platforms

  1. Reads moving ahead of other reads: No
  2. Writes moving ahead of other writes: No
  3. Writes moving ahead of reads: No
  4. Reads moving ahead of writes: Yes

If writes could move ahead of reads, then First would be broken. But as it stands now, First is safe also from CPU reordering.

Making It Work: Conclusion

All of this long text boils down to the following: The Circular Queue's way of accessing tail, head and array is not allowed to be reordered. Thus we can be sure that each atomic update to the indexes is always done after updating the storage (push() or pop()).

Even when the consumer/producer threads are on different CPU cores, they will see the latest change and not some old cached values or messed up by reordering (provided the x86/x64 architecture restrictions).

What's Wrong with this Code?

From a practical, real-world look at the x86 and x64 computer architectures, this code will work. Using it on another platform, say a PowerPC could break the Circular Queue (PowerPC reorders things differently than the x86/x64).

A C++ purist is probably enraged by now by the use of volatile. :p

Disregarding that, it is also important to know that the C++ standard won't guarantee atomicity for integers (this is done nonetheless according to the constraints laid out earlier). With the coming of C++0x things will be much brighter and some confusion can be avoided when we can start using atomic<T>.

Remember also that multi-threaded C++ code is NOT platform independent. Even if it may appear so through whatever threading API you may be using (Qt/Boost/etc). The same goes for this example of a circular FIFO. It's not platform independent.

On a side note, I can mention that Microsoft has given extra responsibility to volatile on VC++ 2005 and newer. See here for details.

Finally ...

What I learned when working with multi-threaded software is that lock-free is anything but easy. If nothing else, there is so much conflicting information about what you can and cannot do that it could confuse anyone.

One view on this is what the C++ standard guarantees. A whole other view is the practical look on what we actually get from today's compilers and computer architectures in use and how they affect things. Lock-Free code with C++ is (until C++0x) a scope outside of the language standard and more of an issue with the compiler, CPU, chipset and memory controller.

One thing is to do a simple FIFO like in this example, a whole different issue is if you want to use something more complex than one reader, one writer. If you feel inclined to do something more advanced, like a lock-free linked list, maybe for multiple readers/writers (yes, simple as it sounds, it is not) then please don't...

I agree with Herb Sutter, this is a difficult topic that has lots of pitfalls. Event when C++Ox is available, more advanced lock-free structures should probably be avoided. If you are deciding to roll your own Lock-Free structure, use the tools available on your platform (like Microsoft's Interlocked**).

Last, if having any doubts. Use locks instead of lock-free. Sure it might introduce some overhead and it will be platform-dependent, but then again nothing is platform-independent when using C++ and threads.

References

  1. [C++ atomic int]
  2. Lockless Programming Considerations for Xbox 360 and Microsoft Windows
  3. Another example of C++ FIFO that I discovered when I was already finished.
    It had some nice discussions in the end about this very topic.
  4. [Herb Sutter] Volatile vs Volatile
  5. Wikipedia about Circular Buffer
  6. Another Dr. Dobbs Circular Queue example
  7. A visual explanation of circular queue
  8. Wikipedia: More about lock-free queues and atomicity
  9. Not related to C++, but maybe still interesting. A C# implementation

Other resources that I linked to, or that might be of interest:

  1. Lock-Free Code: A false sense of security
  2. Wikipedia - FIFO
  3. Wikipedia - Volatile
  4. Wikipedia - Producer, Consumer problem
  5. False Sharing
  6. "What every programmer should know about memory".
    A dry document, but especially chapter 6 (What Programmers can do) is worth reading.

History

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralA most interesting article...
Bill Gord
14:34 16 Nov '09  
I have been using such lock-less circular queues in mainframe system software since about 1985. While the details are obviously quite different (I was not using C++, an X86, nor Windows), I will absolutely state that the mechanism is viable and valuable. The systems involved had from 1 to 24 processors (and perhaps even more in the recent years), with similar (but slightly different, obviously) atomicity, synchronicity and caching issues. I recall that professor E.W.Dijkstra (of the Netherlands) had an interesting paper on this subject in the early 1980's. This is what turned me on to this mechanism; alas I have no copy of that paper (nor any of his other interesting papers of that era).

An important omission of this article is the handling of a full/empty circular queue; a viable and straightforward solution is to employ some signal/wait scheme in those circumstances. This of course, uses "locks" of some sort, but the frequency of their use is trivial if the size of the circular queue is well managed.

This lock-less mechanism becomes very valuable when used on the "single" side of a single-producer/multi-consumer queue or its almost twin, the multi-producer, single-consumer queue. As a further valuable adaptation, the "single" partner can be engineered to have (for example) an array of single/single circular lock-less queues for communicating with a multitude of partners, thus effectively eliminating the "multi" aspect of its partner(s); this statement of course ignores a significant amount of implementation details.

It was not always easy to figure out exactly where to employ this mechanism (especially as a retrofit!), but where it was employed it exhibited seriously exponential performance improvements over previously used communication schemes (to a limit of course!) - meaning that as the number of processes on the "multi" side increased (perhaps well into the thousands) and the number of processors increased (at least to about 20 (the biggest I had to work with was 24)) the whole-system throughput increased much better than linearly (because the queues were seldom in the full/empty state; the signal/wait being relatively expensive).

Good Article, Valuable Material, Thanks!!
Sign In·View Thread·PermaLink5.00/5
GeneralRe: A most interesting article...
KjellKod.cc
20:18 16 Nov '09  
Bill Gord wrote:

An important omission of this article is the handling of a full/empty circular queue;



Yes, definitely. In my experience the queue size should be made large enough that the full criterion almost never happen (full being more of a pain than empty). But of course IF they happen they must be dealt with. A few different scenarios are outlined below - it's all depending on the system requirements how the situation should be resolved.


Possible things to do when need to insert, but queue is full

  1. wait till space for insertion (wait: 'inside' the queue)
  2. wait till space for insertion (wait: 'outside' the queue)
  3. by overwriting another item
  4. by throwing away items that were to be inserted, until space is available again
  5. crash hard, since the mechanism doesn't allow for any of the above. Increase queue size a lot and test again



Thanks Bill for your comment Big Grin
I'm happy you liked it.

Kjell Hedström
Sign In·View Thread·PermaLink
GeneralRe: A most interesting article...
Bill Gord
7:06 17 Nov '09  
I like your solutions #3 and #4 Cool If you are in a "debugging mood" they offer endless hours of enjoyment Dead . Especially if somebody else is fielding customer calls Laugh

Seriously, I tend to use #1 but there have been a few interesting situations where #2 was appropriate. For example the situation I mentioned in my first reply where a single "consumer" has an array of single/single queues to its multitude of "producers". In this case the mechanism needs to have a more global (and more challenging) signal/wait scheme: waiting for a message to arrive in any one of the now-empty queues without incurring overhead in the more normal case.

Again... Thanks for the article and the associated references.

Bill
Sign In·View Thread·PermaLink5.00/5
GeneralVolatile misunderstanding
hax_
8:09 7 Nov '09  
>To take care of the cache and compiler re-order issue we make sure that head, tail and the array storage are all volatile. This means that the compiler is not allowed to re-order the head and tail indexes nor the array. That they are volatile also means that any reading of these will be made from memory and not from an old cached value.

There are five things to be considered when designing lockless algorithms:

- atomicity;
- CPU instruction reordering;
- CPU read/write reordering;
- compiler read/write reordering (code rearrangement);
- compiler optimizations;

Atomicity is assumption that operation will not be interrupted, f.e. value of specified size will be written to memory in single transaction, and another CPU core will not see partially updated variable. 32 bit ints are written atomically on x86. One should use InterlockedIncrement() ( lock: inc [eax] in assembly ) to make atomic var++.

CPU instruction reordering is hardware reordering of machine codes to optimize load of multiple execution units (f.e two U and V units on old Pentium). x86 does not allow rearranging memory access instructions.

CPU write reordering can change order in which writes reach memory ( code writes a, then b, b can arrive to memory first ). x86 does not allow read/write reordering.

Compiler read/write reordering: compiler may want to rearrange code, which can result in different order of variables reading/writing.

Compiler optimizations: compiler may want to keep variable in register and update memory only at the end of block.

What 'Volatile' modifier do, which was original used to tag hardware devices mapped as memory, is inform compiler that variable should be:
- read from memory every time it is read in code;
- writeln every time it is writeln in code;
- 'volatile' variable access is also code rearrangement barrier: no lines can be moved up or down through it (you can threat every access to volatile variable as horizontal barrier in code editor ).

Volatile does not warranty cache coherency and write reordering resolve. Cache coherency is actually warrantied by MESI protocol, used on x86:

http://en.wikipedia.org/wiki/MESI_protocol[^]

Here is a nice article for deep understanding of lockless programming problems:

http://msdn.microsoft.com/en-us/library/ee418650(VS.85).aspx[^]

As for performance, I doubt this queue will be any faster than classic Critical section based queue, because it was designed with focus on implementing lockless algorithm - not on performance ( it is just a tricky exercise to write lockless queue ). If you want to get really fast queue - design in performance in mind first. The main problem here is false sharing, of course: head, tail and array[100] are situated in single cache line - and it means cache miss for every write (very slow !). std::list with CS would probably work faster Smile.


Busy waiting is not the best way to burn CPU cycles; consider at least this implementation with proper waiting on full/empty cases:

Fast synchronization between a single producer and single consumer
http://www.bluebytesoftware.com/blog/2009/10/05/FastSynchronizationBetweenASingleProducerAndSingleConsumer.aspx[^]
Sign In·View Thread·PermaLink4.50/5
GeneralRe: Volatile misunderstanding [modified]
KjellKod.cc
22:22 8 Nov '09  
Great reply hax_!
Thank you for clearing up my misunderstanding about cache coherency. When next updating this text I'll be sure to add your input Big Grin

Regarding usage and speed. I stated in the article that my example was made to be simple. I wanted to explain a techique common in some type of C++ software. It really all depends on what it is to be used for and in what situations. I used links to point the reader towards more explanations of how it works and its uses[^]. I never claimed this circular queue is the sh*t[^]to be the used for every possible single producer -> single consumer scenario...


hax_ wrote:
The main problem here is false sharing

Yep. For a multi-core platform where the two threads are on different cores and the head and tail are often in close proximity this would be a big problem. That's why I referred to this in my article (Improvements and Code Notes) [^]. It's all about what the size of the array is, what is actually stored within it (integers, pointers or whole objects), how the items are processed etc... All these and more things considered (+ testing) can give an understanding of how frequent it would be for false sharing to occur. It is impossible to say a general truth about this since it is all into how the queue is used and not so much the queue by itself.

In my example [^] I provide a silly but very simple and (hopefully) easy to follow example that would definitely suffer on a multi-core platform from false sharing with some time consuming cache line thrashing. The example is made to give an understanding of the code + providing tests at the same time and not to give an ultimate solution.

hax_ wrote:
[...]I doubt this queue will be any faster than classic Critical section based queue [...] std::list with CS would probably work faster

A circular queue/buffer has its specific uses and should be considered for those uses. When a circular queue is important to use it would be really silly to use a std::list.
Some examples of uses[^]:
* Queues when having limited memory or no dynamic memory allocation
* Re-use of same memory without need for object 'shuffling' about when objects are consumed
* Memory buffer overwrites if consumer can't keep up etc.
All of these examples are outside the scope of this article

Of course I could've mentioned more about this and explained it all better Unsure but you can always improve on things. Right Wink

--- one last comment ---
hax_ wrote:
Busy waiting is not the best way to burn CPU cycles; consider at least this implementation with proper waiting on full/empty cases:

Where did you get the idea from that I propose that busy waiting is a preferred way of doing things? In my text I simply leave the cases of empty and full to be solved by the producer and the consumer. (full was not in the code but is trivial and explained & suggested[^] to implement if wanted)
---


Thanks again hax_, in spite of my long answer I really did appreciate your feedback.

modified on Monday, November 9, 2009 11:08 AM

Sign In·View Thread·PermaLink
GeneralVersions
KjellKod.cc
21:37 5 Nov '09  
I just noticed that with FireFox browser the pictures are not displayed correctly. It seems fine with Opera, IE 8 and Chrome but not with FireFox for some reason. I'll try to get time later today to fix this.
Sign In·View Thread·PermaLink
GeneralRe: Versions
KjellKod.cc
0:34 9 Nov '09  
I've sent picture/browser fixes to CodeProject hopefully they'll publish them soon.
Sign In·View Thread·PermaLink
GeneralRe: Versions
KjellKod.cc
14:38 12 Nov '09  
12th November, let's see when it make it through CodeProject update queue.

It was a huge re-write to get all the nitty-gritty details that should satisfy even the toughest expert! Pictures also updated so that they now look OK on most (all) browsers.
Sign In·View Thread·PermaLink
GeneralPortability - is a 64 bit platform "exotic"?
peterchen
1:20 5 Nov '09  
as you write
Third, memory updates must not be re-ordered. This is ensured on Intel x86 platforms (and many others). On something more exotic then it should be verified before use.

As I understand AMD64 and IA64 do require memory barriers. Fortunately, Visual Studio 2005 and later will uses acquire/release memory barriers on volatile reads/writes.

volatile - by the standard - is not intended for multiple threads, just for "unusual memory". So the implementaiton does indeed rely on platform and compiler details (as it already does in assuming atomicity of int reads and writes). Just a fair warning.

Do you have any experience/performance measures of what the difference would be between the "volatile+hope for atomicity" solution presented, and using a "proper" mechanism such as InterlockedCompareExchange?

Personally, I love the idea that Raymond spends his nights posting bad regexs to mailing lists under the pseudonym of Jane Smith. He'd be like a super hero, only more nerdy and less useful. [Trevel]
| FoldWithUs! | sighist | µLaunch - program launcher for server core and hyper-v server

Sign In·View Thread·PermaLink3.00/5
GeneralRe: Portability - is a 64 bit platform "exotic"? [modified]
KjellKod.cc
1:38 5 Nov '09  
peterchen wrote:
So the implementaiton does indeed rely on platform and compiler details

Yes definitely. I think I stated that a couple of times.


peterchen wrote:
hope for atomicity

I wouldn't say "hope for atomicity", those are your words. Theres not "hope" here. I'm dealing with facts. The reason why I wrote this text is that this solution or something very similar to it is very much in use today in several industries,. but maybe mostly in the embedded world. It does work but as I mention in the text it's also very much platform and compiler dependent. My experience with this is mostly with the GCC compiler for embedded platforms.

The text deals with problem-solution that is not within the scope of the C++ standard.

I have not made any comparisons with InterlockedCompareExchange. The volatile will cause indexes to be re-read from memory when accessed (this is the whole point). This is slow! But not necessarily slower than InterlockedCompareExchange. ... It would be fun to test the differences.

modified on Friday, November 20, 2009 3:04 AM

Sign In·View Thread·PermaLink
GeneralRe: Portability - is a 64 bit platform "exotic"?
peterchen
2:46 5 Nov '09  
KjellKod.cc wrote:
The volatile will cause indexes to be re-read from memory when accessed (this is the whole point). This is slow!


volatile helps with caching done by the compiler, but it is not guaranteed to help with instruction reordering done by the CPU. And as said, in my understanding that takes place in todays 64 bit platforms. Specifically, comparing head and tail is at risk.

Also as said, volatile does help when compiling on VS 2005.

KjellKod.cc wrote:
hope for atomicity

My problem with these assumptions is that they break softly - a different compiler, a misguided #pragma pack, someone decrementing ESP by an odd amount... Especially for concurrency problems that have very odd patterns of when and how they surface.

One could probably add at least an run time check that the values are well aligned, but I don't see a suitable check for atomicity of access.

It's interesting that you mention embedded. I think these guys have a real advantage - they are much more locked into their platform, and when they have to port, memory barriers are probably their least problem.

Personally, I love the idea that Raymond spends his nights posting bad regexs to mailing lists under the pseudonym of Jane Smith. He'd be like a super hero, only more nerdy and less useful. [Trevel]
| FoldWithUs! | sighist | µLaunch - program launcher for server core and hyper-v server

Sign In·View Thread·PermaLink3.00/5
GeneralRe: Portability - is a 64 bit platform "exotic"? [modified]
KjellKod.cc
3:10 5 Nov '09  
Yes I agree with what you wrote. I hope I made it explicitly clear in the text that it is compiler and platform dependent.

Using memory barriers and InterlockedExchange or similar is also compiler/platform dependant. In fact there is no such thing as platform independent C++ when it comes to multithreading.

Whether or not it breaks in 64-bit platforms remains unclear to me. My previous understanding of this is that 64-bit platforms work similar to 32-bit platforms, and that writing of int would be atomic (if aligned) without any problems with memory reordering. But I haven't had the pleasure to work on them and I won't go on a wild goose chase to find out what's accurate here.

[2009--11-20 Note: I did more research about this and it turns out my initial text was correct. The code works on both x86 and x64 platforms - to mention a few. I updated the article 12th of November to include more about the x64 platforms as well. Compiler/CPU/Memory instruction reordering are all mentioned in the latest update.]

That people should use memory barriers and features such as InterLockedExchange is of course common sense if one is not sure about his/her platform. But my point with the article was to describe this solution that I have seen and why and when it will work. (and when it won't).

modified on Friday, November 20, 2009 3:09 AM

Sign In·View Thread·PermaLink
GeneralRe: Portability - is a 64 bit platform "exotic"?
peterchen
13:02 5 Nov '09  
KjellKod.cc wrote:
Using memory barriers and InterlockedExchange or similar is also compiler/platform dependant. In fact there is no such thing as platform independent C++ when it comes to multithreading.


Important point! (The only advantage is, it complains at compile time).

And yes, your article points out that dependency, only it's a little bit buried, and I'm afraid many people will walk away with the impression that "volatile is enough".

Personally, I love the idea that Raymond spends his nights posting bad regexs to mailing lists under the pseudonym of Jane Smith. He'd be like a super hero, only more nerdy and less useful. [Trevel]
| FoldWithUs! | sighist | µLaunch - program launcher for server core and hyper-v server

Sign In·View Thread·PermaLink1.00/5
Generalvolatile Element array[Capacity]; [modified]
KjellKod.cc
23:45 4 Nov '09  
From supercat9's first reply (see http://www.codeproject.com/Messages/3260356/Re-Where-are-the-atomic-calls.aspx[^])
supercat9 wrote:"
The single-producer single-consumer case can be handled nicely if integer loads and stores are atomic, even if nothing else is.

He is definitely right, in fact I think I erred on the safe side by making the storage volatile. As you already know volatile has nothing to do with making anything atomic only to avoid make sure that all reads/writes are from memory and avoiding of compiler re-ordering and unwanted optimization.
volatile Element array[Capacity]; // should be changed to "Element array[Capacity]"
In fact as I understand it's enough to have the indexes as volatile and not the array. Removing the volatile from the storage is still OK. It cannot be reordered since we use volatile variable when accessing it.

On a side note: Microsofts take on volatile gives it release/aquire semantics which means that it actually works as a re-ordering barrier - but my article and example works just as well on other systems (with already specified restrictions)

With the original code, we might run into trouble when defining our own types since storage of non-volatile Element (most likely) into a volatile array will give you compiler errors. (in C++) A volatile array has all its elements as volatile...
For this reason I would also remove the const in
bool push(const Element& item_);
since that will create problem for your own (non-primitive) types (maybe encapsuled within a smart pointer)

I'll wait for more input for this article but I'll probably update the code here to be something like this
template<typename Element, unsigned int Size>
class CircularFifo {
public:
enum {Capacity = Size+1};

CircularFifo() : tail(0), head(0){}
virtual ~CircularFifo() {}

bool push(Element& item_);
bool pop(Element& item_);
bool empty() const;

private:
volatile unsigned int tail;
Element array[Capacity]; // doesn't need to be volatile
volatile unsigned int head;
unsigned int increment(unsigned int idx_) const;
};

Note that the array is in between tail and head to at least give the possibility of avoiding some false sharing


[2009-11-20 Note: The code was updated on the 12th of November.]

modified on Friday, November 20, 2009 3:11 AM

Sign In·View Thread·PermaLink
GeneralWhere are the atomic calls?
Mauro Leggieri
5:29 4 Nov '09  
Hi,

I cannot imagine a lock-free or wait-free algorithm without using interlocked calls or atomic assembly code.

For example: the main mistake is that you are assuming that this sentence is atomic and it isn't.

idx_ = (idx_+1) % Capacity;

A secondary thread can interrupt after getting idx_+1 and before making the modulus calc.

Volatile keyword make the compiler to always read and write variables from memory but didn't make them atomic.

Best regards,
Mauro H. Leggieri
Sign In·View Thread·PermaLink1.00/5
GeneralRe: Where are the atomic calls? [modified]
KjellKod_cc
5:39 4 Nov '09  
Please read the whole article. The code will not work for other scenarios than one thread producer and one thread consumer.

You are right that it will fail if other threads will interrupt and mess with the 'wrong' index (depending if it's the consumer or producer) - but that's the whole point of the article. This is a maximum two thread scenario.


If you DID mean the case when the producer thread calls increment(tail) and it gets interrupted by the consumer thread that calls increment(head) than that's not a problem. If you check the code you will see that increment is a reentrant function.

For explanation of reentrant please check http://en.wikipedia.org/wiki/Reentrant_(subroutine)[^]

I make no assumption that
idx_ = (idx_+1) % Capacity; 
is atomic. I do make the assumption that assignment of head and tail is atomic on
some platforms. This is done when the increment function returns and not within the increment function as you misunderstood it.

It's the combination of int and volatile that makes the reading/writing of the indexes to be atomic. But did you really read the article? I have a whole section just dedicated to this explanation.

modified on Wednesday, November 4, 2009 11:19 AM

Sign In·View Thread·PermaLink
GeneralRe: Where are the atomic calls?
Mauro Leggieri
7:06 4 Nov '09  
KjellKod_cc wrote:
Please read the whole article. The code will not work for other scenarios than one thread producer and one thread consumer.


Smile Ok

Do you have (or will you post) an example for using multiple threads?

Thanks, Mauro.
Sign In·View Thread·PermaLink3.00/5
GeneralRe: Where are the atomic calls?
KjellKod_cc
13:13 4 Nov '09  
Mauro Leggieri wrote:
Do you have (or will you post) an example for using multiple threads?


At the moment I will not attempt that Hmmm but I can give a pointer. If wanting to do it easy (but maybe not so efficient) you can pretty much use the setup I used in this article but for multiple producers and a single consumer.
The trick is to have each producer using its own fifo and the consumer goes round-robin and takes from them.

----
More: A linked-list version [^]for single-producer single consumer can be found at StackOverflow by Steve Gilham, I would recommend also to read the Herb Sutter articles that he links to (there are two)
Sign In·View Thread·PermaLink
GeneralRe: Where are the atomic calls?
supercat9
6:45 4 Nov '09  
Mauro Leggieri wrote:
I cannot imagine a lock-free or wait-free algorithm without using interlocked calls or atomic assembly code.


The single-producer single-consumer case can be handled nicely if integer loads and stores are atomic, even if nothing else is. As has been noted, using default memory-alignment options, that will naturally be the case on most systems. Even where integer loads and stores are not atomic (e.g. 8-bit systems), there are ways of handling single-write multi-read variables without waits, though the optimal approach will depend upon the possible nature of concurrency (writer interrupting reader, reader interrupting writer, multi-threaded, etc.).

If the queue size is a power of two, one may improve things by keeping the head and tail pointers as "straight" incremented values (without doing a modulus operation on them) and using a bit mask on the variables whenever they are used. This approach will allow the queue to hold one more item than it otherwise could (since head and tail pointers will differ by the queue size).

I personally use a lock-free linked-list queue design using CompareExchange; that approach makes it possible to have multiple producer threads without conflict; I'm not sure whether it's better to do that, or whether it would be better to use locks with a circular array-based queue. Locks are bad if they can be held for unknown duration, but I think they're generally okay if the only time they are ever acquired is for operations which are all but guaranteed to complete quickly (if a thread gets killed asynchronously, of course, there are no guarantees of anything).
Sign In·View Thread·PermaLink5.00/5
GeneralRe: Where are the atomic calls?
Mauro Leggieri
7:00 4 Nov '09  
For almost all instructions, e.g. x86, inc dword ptr [eax];, you must prefix it with a "lock" in order to avoid 2 processors accessing the same area.
Sign In·View Thread·PermaLink2.00/5
GeneralRe: Where are the atomic calls?
supercat9
7:29 4 Nov '09  
Mauro Leggieri wrote:
For almost all instructions, e.g. x86, inc dword ptr [eax];, you must prefix it with a "lock" in order to avoid 2 processors accessing the same area.


Without a LOCK prefix, an "inc" may be decomposed into the following three operations; if the variable does not span a longword boundary, all three operations will be individually atomic.

-1- Read the variable into a private temp register
-2- Increment the private temp register
-3- Store the register to the variable

If multiple threads on separate processors may attempt to write the variable simultaneously, the above sequence could yield undesirable behavior (e.g. processor #1 reads the variable for the INC, processor #2 writes a new value, and processor #1 writes the new value wiping out processor #2's change). In a single-producer single-consumer queue, however, variables may be partitioned into those which will only be written by the producer, and those which will only be written by the consumer. The only problem scenario then would be if one thread were to read a variable while another thread was writing it, such that a partially-updated value was read (e.g. one thread was bumping a counter from 0x12FF to 0x1300 and the other thread read a value of 0x1200 or 0x13FF). On a 32-bit architecture using word-aligned variables, that simply isn't going to happen.
Sign In·View Thread·PermaLink5.00/5
GeneralRe: Where are the atomic calls?
KjellKod_cc
13:00 4 Nov '09  
Thanks for the input supercat9. Very nicely explained Smile

(from your first post)
supercat9 wrote:
I personally use a lock-free linked-list queue design using CompareExchange; that approach makes it possible to have multiple producer threads without conflict

At least Mauro and I would be interested in this, anything you care to share?

Once again, thanks for your replies.
Regards
/ Kjell
Sign In·View Thread·PermaLink
GeneralRe: Where are the atomic calls?
supercat9
6:33 5 Nov '09  
KjellKod_cc wrote:

At least Mauro and I would be interested in this, anything you care to share?


The code's in vb, and its fundamental design relies upon .net garbage collection, so I don't know that it would be helpful if you're programming in C++.

Actually, my lock-free queue was one of my earlier efforts, and if I were going to post it I would want to clean it up a bit first. A design aspect I thought was interesting and is sometimes helpful, but was in retrospect a bad idea, is that it uses separate "queue" and "queue reader" objects. Data that's put into a queue for which no readers exist will simply be discarded; otherwise, data that's added to the queue will be made available to any readers that are attached. The queue itself only holds a reference to the most recent item added, and has no clue what readers exist. Each reader holds a reference to the item before the last one read, and each item holds a reference to the next one.

In some ways it's elegant that items become eligible for garbage collection when there are no longer any readers that are interested in them, and that this occurs without the queue object having to know anything about the readers that exist. In retrospect, though, it was a bad idea since a reader which is created and then never read from will cause its associated queue to hold data forever until memory fills up. Since the queue object has neither any clue about what readers exist, nor any sort of reference to any older items, it has no way to enforce a queue size limit. It may be possible to have the queue object keep WeakReferences to some older queue entries and, if they still exist after enough newer entries are added, invalidate their links. Maintaining thread safety while doing this would be tricky, however, and in practice the multi-readers feature really hasn't been all that useful.
Sign In·View Thread·PermaLink2.00/5


Last Updated 14 Nov 2009 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2009