|
|
Comments and Discussions
|
|
 |
|

|
this article is excellent because i found solutions for more questions on thread.
|
|
|
|

|
Thank you. I am glad it + forum was helpful
|
|
|
|

|
There is a simple way to handle multiple producers and consumers using "lock" at the right places in the code. Thanks, Jorge Orejel.
|
|
|
|

|
Definitely! I would even go so far as to say that for most coders a lock-free structure is probably never of interest. For most code a sensible use of locks will do the trick and the bottlenecks will likely be elsewhere,... for most code, but not for all code.
For lock-based approach with multiple clients I like Anthony Williams approach in his book "C++ Concurrency In Action". I use it for the internal message queue in an asynchronous, "crash safe" logger g2log[^]. A decent approach where the locks are hidden internally in the structure to avoid common problems involved with external locking. Here is a sample[^]
|
|
|
|

|
Is it possible to get such code anywhere? Even multiple producer/single consumer would be already awesome. scenario: library logging. Thanks so much for sharing, cheers
|
|
|
|

|
Also, I wonder if in order to allow overwriting for very slow consumers that do not keep up with the producers rate will be safe to just avoid branching i.e. memory_relaxed_aquire_release.hpp:52 and incrementing also the head as one location has been overwritten. thanks!
|
|
|
|

|
I am a little confused by your comment as it was made in reply my comment where I pointed out exactly where you can find such code.
Yes. Multiple producers sending log entries to a log consumer is very easy to implement if the message queue is protected by mutex. I used exactly that (see earlier answer) in my asynchronous, "crash safe" logger g2log[^].
Or are you asking about lock-free multiple producer queue? Sure, that is possible as well. But are you sure you actually need lock-free?
Scenario 0: Just the protected shared_queue pointed to in my earlier answer.
Scenario 1: Multiple producers are known in advance. Each can then have their own lock-free "single-producer, single consumer" queue (a la the one in this article). The consumer can be the same thread just serving the producer-queues in a round robin manner
Scenario 2: Use a dedicated multiple producer queue. There are several examples out there. Google for it, check here at CodeProject (here for instance[^])
Scenario 3: Check out the "Disruptor" java pattern. There exist a C++ implementation as well. I am playing with the idea of making something similar for g2log.
|
|
|
|

|
Thanks for answering, I didnt get any email notification.
Yeah, I need lock free as one of the producing thread is audio real-time, and I can't lock a mutex on that. Looks like the implementation of scenario 2 is good for my needs! will check it out now, thank you in the meantime
|
|
|
|

|
It seems to me that there is a closing bracket missing (2 times in this article):
while(!y.load(std::memory_order_aquire);
|
|
|
|

|
Thank you. I will fix that when I do the correction
|
|
|
|

|
A few typos detected,. I will wait a few days and then correct it.
* First "push()" function != turned to == in the article text. The source code is correct
* the the
* another another
* aquire --> acquire (3am writing gives me creative spelling)
* some broken links in the beginning
Fixed!
modified 2 Dec '12 - 11:52.
|
|
|
|
|

|
Thank you. Do come back in a couple of days. A complete re-write of the article is pending. From platform hack to C++11 guaranteed.
|
|
|
|

|
Hi,
Thanks for the very nice article. I believe the circular queue will only work between Windows threads. But I would like to use this to pass data between two Windows processes and use shared memory. Is this possible? Do you have any suggestions on how to accomplish this?
Thank you!
|
|
|
|

|
Hi,
You are more then right. I tested this originally on x64 and x86 both for Windows (VS) and Linux (gcc). Last week I ran tests for this Queue and I can now easily break it on Linux (x86+x64).
I am putting up a huge disclaimer on this article shortly and I will instead point out how to do this in a guaranteed way both on Linux and Windows.
The easy solution however is to just throw out old volatile code and use the C++11 default sequential atomics instead. Then it is almost no code changes apart from the volatile to atomic . It is of course really is an enormous difference behind the scene.
namespace memory_sequential_consistent {
template<typename Element, size_t Size>
class CircularFifo{
public:
enum { Capacity = Size+1 };
CircularFifo() : _tail(0), _head(0){}
virtual ~CircularFifo() {}
bool push(const Element& item); bool pop(Element& item);
bool wasEmpty() const;
bool wasFull() const;
bool isLockFree() const;
private:
size_t increment(size_t idx) const;
std::atomic <size_t> _tail; Element _array[Capacity];
std::atomic<size_t> _head; };
The implementation for the sequential consistent atomic solution above is very straight forward.
template<typename Element, size_t Size>
bool CircularFifo<Element, Size>::push(const Element& item)
{
const auto current_tail = _tail.load();
const auto next_tail = increment(current_tail);
if(next_tail != _head.load())
{
_array[current_tail] = item;
_tail.store(next_tail);
return true;
}
return false; }
template<typename Element, size_t Size>
bool CircularFifo<Element, Size>::pop(Element& item)
{
const auto current_head = _head.load();
if(current_head == _tail.load())
return false;
item = _array[current_head];
_head.store(increment(current_head));
return true;
}
template<typename Element, size_t Size>
bool CircularFifo<Element, Size>::wasEmpty() const
{
return (_head.load() == _tail.load());
}
template<typename Element, size_t Size>
bool CircularFifo<Element, Size>::wasFull() const
{
const auto next_tail = increment(_tail.load());
return (next_tail == _head.load());
}
template<typename Element, size_t Size>
bool CircularFifo<Element, Size>::isLockFree() const
{
return (_tail.is_lock_free() && _head.is_lock_free());
}
template<typename Element, size_t Size>
size_t CircularFifo<Element, Size>::increment(size_t idx) const
{
return (idx + 1) % Capacity;
}
To Verify Broken Queue and C++11 Corrected Queue
Last, to easily verify that in fact this article's code is just as dangerous as I tried to show. It can be tested on Windows (VS2012) and Linux (gcc). At least on Linux with gcc it now breaks as proven by a simple test. Below is a snapshot of a unit test that I ran that can show this.
Some functionality is not here but you get the gist I am sure. You can replace whatever's missing easily for something equivalent.
using namespace memory_sequential_consistent;
typedef unsigned int Number;
namespace {
const std::size_t k_queue_size = 1000;
const std::size_t k_amount_to_process = 10*1000*1000; const Number k_lower_limit = 0;
const Number k_upper_limit = 10000;
const std::string queue_type_message = "CircularFifo_SEQUENTIAL (default sequential) ";
typedef Number Message;
typedef CircularFifo<Message, k_queue_size> MessageQueue;
}
std::vector<Message> Producer(MessageQueue& out_fifo, const size_t enough_processed)
{
std::vector<Message> all_data;
all_data.resize(enough_processed);
replaceContentWithRandoms(all_data, k_lower_limit, k_upper_limit);
std::vector<Message> pushed_data; for(Message m : all_data)
{
while(false == out_fifo.push(m))
{
std::this_thread::yield();
}
pushed_data.push_back(m);
}
return pushed_data;
}
std::vector<Message> Consumer(MessageQueue& in_fifo, const size_t enough_processed)
{
std::vector<Message> received_data; for(size_t count = 0; enough_processed > count; ++count)
{
Message m;
while(false == in_fifo.pop(m))
{
std::this_thread::yield();
}
received_data.push_back(m);
}
return received_data;
}
TEST(ThreadedTest, SingleProducerSingleConsumer)
{
std::cout << queue_type_message << "\nPrepare to wait a bit, pushing ";
std::cout << k_amount_to_process << " through a circular fifo of size: ";
std::cout << MessageQueue::Capacity-1 << std::flush;
MessageQueue msg_queue;
g2::StopWatch clock;
auto future_consumed =
std::async(std::launch::async, &Consumer, std::ref(msg_queue), k_amount_to_process);
auto producer_made = Producer(std::ref(msg_queue), k_amount_to_process); auto consumer_received = future_consumed.get(); auto elapsed_ms = clock.elapsedMs();
std::cout << "\n\t#" << k_amount_to_process << " items took #" << elapsed_ms.count();
std::cout << " milliseconds to shuffle" << std::flush;
ASSERT_TRUE(consumer_received.size() == producer_made.size());
ASSERT_TRUE(consumer_received.size() == k_amount_to_process);
auto getitr = consumer_received.begin();
auto putitr = producer_made.begin();
auto idx = 0;
while(getitr != consumer_received.end())
{
if((*getitr) != (*putitr))
{
std::cout << "\nC:" << *getitr << "\tP:" << *putitr << "\tidx:" << idx;
}
++idx;
++getitr;
++putitr;
}
ASSERT_TRUE(std::equal(producer_made.begin(), producer_made.end(), consumer_received.begin()));
}
modified 26 Nov '12 - 4:35.
|
|
|
|

|
Hello,
Thank you for a great article.
I would like to use this kind of structure as a buffer to push and pull text.
Therefore I would rather use a memcpy and then increment the index by the length of the string all in one call rather than push and pop character by character. I would check if there is enough space prior to this of course.
Would doing this break any assumptions?
I don't think it would, but then again I'm new to this topic...
Thank you
|
|
|
|

|
Hi a_kavo
What you describe seems similar to the circular buffer I first saw and which gave idea to this article, to investigate whether or not such code would work. You can see what I wrote about that "initial queue" here[^]. So "yes" such code can work, but I would do you a disservice to say that it definitely would. It all "depends"
{ queue implementation : hardware architecture : compiler }
Some questions back to you:
1. What platform have you planned to use this on? This code, as is, could break on some platforms.
2. Do you have access to C++11 std::atomic, i.e visual studio 11 (2012), gcc 4.7 or the C++11 library implementation by just::thread?
If you can say yes to any of these questions then the head and tail should take advantage of that instead of how it is in the current implementation
* std::atomic only guarantees atomicity not lock-free. If it is lock-free (platform dependent) is easy to find out, I belive the call std::atomic::is_lock_free() or similar would answer that for your platform.
--- So bottom line: "Maybe", but you should look into using C++11 features instead.
|
|
|
|

|
Hello,
I ended up implementing the interface I mentioned in my question and wanted to post it here for the record in case anyone finds it useful.
What I was looking for was an interface similar to socket send()/recv() which could allow for batching of elements or for the use of the ring buffer as a message queue.
Edited/fixed bugs 4/3/13
template<typename Element, size_t Size>
size_t RingBuffer<Element, Size>::push(const Element* data, size_t len)
{
size_t current_tail = _tail.load(std::memory_order_relaxed);
size_t sz_left = size_left_p_();
size_t sz_to_end = Capacity - current_tail;
size_t sz_to_push = std::min(len, sz_left);
sz_to_push = std::min(sz_to_push, sz_to_end);
memcpy(&_array[current_tail], data, sz_to_push*sizeof(Element));
current_tail = increment(current_tail, sz_to_push);
_tail.store(current_tail, std::memory_order_release);
return sz_to_push;
}
template<typename Element, size_t Size>
size_t RingBuffer<Element, Size>::pop(Element* data, size_t len)
{
size_t current_head = _head.load(std::memory_order_relaxed);
size_t sz_used = size_c_();
size_t sz_to_end = Capacity - current_head;
size_t sz_to_pop = std::min(len, sz_used);
sz_to_pop = std::min(sz_to_pop, sz_to_end);
memcpy(data, &_array[current_head], sz_to_pop*sizeof(Element));
current_head = increment(current_head, sz_to_pop);
_head.store(current_head, std::memory_order_release);
return sz_to_pop;
}
template<typename Element, size_t Size>
size_t RingBuffer<Element, Size>::size_c_() const
{
size_t curTail = _tail.load(std::memory_order_acquire);
size_t curHead = _head.load(std::memory_order_relaxed);
if(curTail >= curHead)
{
return curTail - curHead;
}
return Capacity - curHead + curTail;
}
template<typename Element, size_t Size>
size_t RingBuffer<Element, Size>::size_left_p_() const
{
size_t curTail = _tail.load(std::memory_order_relaxed);
size_t curHead = _head.load(std::memory_order_acquire);
if(curTail >= curHead)
{
return Size - curTail + curHead;
}
return curHead - curTail - 1;
}
template<typename Element, size_t Size>
inline size_t RingBuffer<Element, Size>::increment(size_t idx, size_t step) const
{
return (idx+step) % Capacity;
}
I believe the use of memory barriers above is correct and C++11 takes the pain away of making such code portable by abstracting the memory models of various architectures.
By using the above interface, throughput is increased as the cost of the most expensive operations like
_head.store(current_head, std::memory_order_release);
is amortized.
Thanks
modified 3 Apr '13 - 10:59.
|
|
|
|
|

|
if there are many threads write FIFO and only one thread read, but no concurrent write at same time, such as:
thread 0 read for ever
thread 1 write to fifo
thread 2 write to fifo
...
question: in this case, the FIFO is thread safe or not? why?
|
|
|
|

|
As it is now it is not thread-safe for multiple producers.
bool CircularFifo<Element, Size>::push(Element& item_)
{
int nextTail = increment(tail);
if(nextTail != head)
{
array[tail] = item_;
tail = nextTail;
[...]
As you see above. If any other thread would come in then it would be à mess.
One example: threadA gets value X for nextTail, threadB gets value X+1 for nextTail. ThreadB runs the function to completion and sets tail = X+1. ThreadA continues and sets tail to X. ThreadC comes in and sets tail to X+1.
|
|
|
|

|
i'm a bit confused. let's say there's a thread pool to produce message, and there's a lock arround CircularFifo::push:
{
Lock lk;
fifo.push(item);
}
multiple threads no way to do concurrent push on FIFO, is this NOT thread-safe?
|
|
|
|

|
Sorry, I got confused by your question.
Since it is a lock-free queue putting a lock on top of it seems out-of-place to me. I think it is better not to mix them lock-free + lock together. If you need a locked-shared queue then you should strongly consider keeping the lock inside the queue - this avoids potential queue/thread misuse.
If you need a shared_queue that many producers, even many consumers can use simultaneously then I can suggest a couple of examples.
1. A lock protected shared_queue inspired by Anthony Williams book "C++ Concurrency in Action". You can see examples of hit here https://bitbucket.org/KjellKod/g2log/src/d4518b302cc8/g2log/src/shared_queue.h
[^] and here http://www.codeproject.com/script/Articles/ViewDownloads.aspx?aid=288827
There are lock-free multiple producer queues here at CodeProject,http://www.codeproject.com/KB/threads/arraylockfreequeue.aspx?msg=3769640#xx3769640xx I have not tried them but if it is lock-free you are after why not check out
|
|
|
|
|
|

|
Thank you Manoj.
I will try to make an update soon (well in 2012),. or maybe a follow-up would be better. To show how this simplistic lock-free queue will benefit from C++11
|
|
|
|

|
in order to choose best size of the queue, we need to know the size used by the queue.
q1: how to get the size used by the queue?
q2: is this thread safe ?
modified 26 Sep '11 - 8:13.
|
|
|
|

|
Member 2021086 wrote: q2: is this thread safe ?
Well, the whole article is to describe when it is thread safe and when it's not. In short, if only two threads (one consumer, one producer) are involved and your platform is the right one then yes.
HOWEVER. I wrote this as a proof-of-concept queue and explains why it works and when it doesn't. I actually suggest you use another queue but this one. So instead of answering your "q1". I give back questions for you to ponder
1. Why lock-free? Unless you absolutely need to why not just use a normal std::deque or std::queue and wrap it and making the access of it that way thread safe? An example of this that I've used many times is Anthony Williams suggestion for a thread safe queue
2. If you absolutely need a lock-free queue, then go for a queue that was not made proof-of-concept but for performance. Look at this message comment on my article, and follow its link
3. Last. If all you want is a circular queue. Your platform is right, etc then yes, you can go with something like this or something similar but more optimized. Koders.com is a good source for code. Dr.Dobbs I believe have a few articles on the subject as well.
|
|
|
|

|
thanks for the reply.
KjellKod.cc wrote: Well, the whole article is to describe when it is thread safe and when it's not. In short, if only two threads (one consumer, one producer) are involved and your platform is the right one then yes.
I read about it is thread safe if only two threads(one read, one write) on x86 or x64 platform.
my question actually is "is this queue has a thread safe (only one read, one write)method called 'size()' to make sure how many elements pushed in the queue?'
I need to choose RIGHT capacity (large enough that push always succed but not larger to waste memory) of the queue, in this case, i need to know the size of the queue (call size() funcion in read thread or write thread but not affect the queue).
KjellKod.cc wrote: 1. Why lock-free? Unless you absolutely need to why not just use a normal std::deque or std::queue and wrap it and making the access of it that way thread safe? An example of this that I've used many times is Anthony Williams suggestion for a thread safe queue
I actualy done this already, I want test the lock free ring buffer for performence and memory fragmentation.
KjellKod.cc wrote: 2. If you absolutely need a lock-free queue, then go for a queue that was not made proof-of-concept but for performance. Look at this message comment on my article, and follow its link
3. Last. If all you want is a circular queue. Your platform is right, etc then yes, you can go with something like this or something similar but more optimized. Koders.com is a good source for code. Dr.Dobbs I believe have a few articles on the subject as well.
yes, I want a lock free circualar buffer. in my case, there's two thread model:
1. one thread read, one thread write, this fits the queue's requirements. (x86 or x64)
2. one thread read, many thread write. this can be done by split multi circular buffer.
for example, one of many write thread has its own circular buffer to write. this also fits the queue's requirements. (x86 or x64)
I want to test how many performence improvment when lock free circluar buffer used.
thanks again.
|
|
|
|

|
Hi,
First you saw that I posted the wrong link the first time? The correction:
KjellKod.cc wrote: Sorry. I pointed you to the wrong link. The correct one is this. Yet another implementation of a lock-free circular array queue by Faustino Frechilla
Then let's answer your questions.
Member 2021086 wrote: my question actually is "is this queue has a thread safe (only one read, one write)method called 'size()' to make sure how many elements pushed in the queue?'
I used to have a size function. But I removed it from this code so as only to show the bare essentials. Would such a size() function be thread safe?. Well Yes and No. Yes because it would calculate the size at that given time accurately, however since you (might) have one thread pushing and another thread removing items from the queue by the time you read the value the accurate size might have changed. As a test function or as a snapshot of what-size-that-was you can use such as function, as long as you're aware that the size might be different when you read the value.
From the top of my head, without having tested it (I'm not at a workstation right now) the formula should be
void size_t size() const
{
if(tail >= head)
{
return tail - head;
}
else
{
return Capacity - head + tail;
}
}
Or if you want to make it "short"
void size_t size() const
{ return (tail >= head ? tail: tail + Capacity) - head; }
Member 2021086 wrote: I need to choose RIGHT capacity (large enough that push always succed but not larger to waste memory) of the queue
Yes, this is the downside of using a circular queue. My recommendation is as follows.
1. Hack the code and insert the size() function. Every time you make a push you can check inside push() what the size was. Use it as a "high watermark" remainder. Every time the size() is larger than it ever was before you store that (otherwise ignore it). Then in the destructor you just print out what the high watermark came to. Use a ridiculous large queue to start with, just for testing, and really stress test it!
At a previous project we used 2 * high_watermark just because it's more pain to deal with a full queue than the extra space we used. In fact, such a "high watermark" is probably good to keep. You can log whenever it get over a "warning limit". This way you can tune it easy as you go.
Long answer. I hope it helped.
|
|
|
|

|
thanks for the answer.
KjellKod.cc wrote: First you saw that I posted the wrong link the first time? The correction:
KjellKod.cc wrote: Sorry. I pointed you to the wrong link. The correct one is this. Yet another implementation of a lock-free circular array queue by Faustino Frechilla
yes. I read about it. but I think this code is better for me in my case, since it simple enough to understand, it has some limits, but with carefully desgin, code, we can accept these limits, even more memory to waste, since it can avoid memory fragmentation.
KjellKod.cc wrote: 1. Hack the code and insert the size() function. Every time you make a push you can check inside push() what the size was. Use it as a "high watermark" remainder. Every time the size() is larger than it ever was before you store that (otherwise ignore it). Then in the destructor you just print out what the high watermark came to. Use a ridiculous large queue to start with, just for testing, and really stress test it!
At a previous project we used 2 * high_watermark just because it's more pain to deal with a full queue than the extra space we used. In fact, such a "high watermark" is probably good to keep. You can log whenever it get over a "warning limit". This way you can tune it easy as you go.
now I know the size() only useful when push() the elements, the high watermark way is help a lot to me.
thanks again. and great job!
|
|
|
|
|

|
"If more threads are wanted, then a different solution should be used, i.e. queue with write/read locks" - but we can create multiple queues and push the same item simultaneously into all of them!
|
|
|
|

|
True. It's also possible with other types of lock free queues
See this for example
http://www.codeproject.com/KB/threads/MultiThreadSimpleQueue.aspx
But it's easier with read/write locks so unless the lock-free solution is tried and proven to be good AND much more effective than the lock solution (based on the specific instances need) than my recommendation is to go with read/lock.
|
|
|
|

|
Thanks, but I don't need solution right now. I'll experimenting, and maybe I'll publish another code.
|
|
|
|

|
"If more threads are wanted, then a different solution should be used, i.e. queue with write/read locks" - but we can create multiple queues and push the same item simultaneously into all of them!
|
|
|
|

|
Nice article, good balance between theory and practice, nice to see the resulting class has real-world utility. GM5.
Jerry.
|
|
|
|

|
Hello,
as it is a matter strictly related with the subject of your article,
I'd like to point you to the article I wrote a while ago on the subject
of an unbounded multithreaded "Standard" queue.
I've to say that I was motivated in part also by your work, and that I point
to your article for the simple circular buffer.
My article link is :
A Standard Multi-threaded Dynamic Queue
In fact, I cite as sources for the description of the "standard" queue, the following:
1) The textbook: "C++ Concurrency in Action - Practical Multithreading" by Anthony Williams.
2) The article, that you may find on the web, by M. Michael and M. Scott : "Nonblocking algorithms and preemption-safe locking on multiprogrammed shared - memory multiprocessors."
Journal of Parallel and Distributed Computing, 51(1):1-26, 1998.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.1892&rep=rep1&type=pdf
In fact, in my article I present ONLY the "standard" unbounded multithreaded queue (FIFO) using locks that is suitable for multiple producers / consumers, this "standard" queue is used for comparison purposes in the reference (2) above.
BUT what you may find much more interesting is the non-blocking algorithm presented in reference (2) above (that is using CAS instructions). In that paper M. Michael and M. Scott do present a non-blocking algorithm for multiple producers / consumers that is, obviously, an enhancement compared to the simpler circular buffer if memory limitations are not an issue. This non-blocking algorithm is compared to the "standard" queue that I present in my article. It is shown in the paper by M. Michael and M. Scott what are the benefits in performance by using the non-blocking algorithm vs. the standard queue using locks.
I think you will find such a non-blocking algorithm of interest as it removes the limitations of a sigle-producer/single-consumer of the simpler circular buffer.
Cheers
Federico Strati
|
|
|
|

|
Excellent job! Good solution and demonstration on single producer-sonsumer context.
|
|
|
|

|
First of all, I have to say it is a great job in this area. Great job!
About the picture to illustrate a full queue, where the TAIL pointer might point to an empty item, but in the picture, it points to item7. My misunderstanding or it's a little mistake on full queue? At this time, if producer make more item, and it will insert into the position and make TAIL pointer at same position with HEAD pointer. So the TAIL pointer will always keep empty, right?
If I am wrong, please tell. Sorry for any mis-understanding on this great job!
The world is fine.
|
|
|
|

|
You're completely correct. Of course the "item 7" should be blank. Sorry for the confusion.
-- Thanks for pointing it out. I actually caught another typo thanks to this. In the text I wrote
Author wrote: When the buffer is full, there will be a one slot difference between head and tail. At this point any writes by the consumer will fail or else the write index would be the same as the read index and two bad things would happen
Of course the "consumer" is a typo. The consumer pops/consumes items, it doesn't write/push to the queue
So the correct sentence should be:
Author wrote: When the buffer is full, there will be a one slot difference between head and tail. At this point any writes by the producer will fail or else the write index would be the same as the read index and two bad things would happen
When I get time for it I will update the picture and the text! Thankfully the source code is correct
|
|
|
|

|
Thanks for the feedback. It took a long time for me to get around to fix it, but now I've sent in an article update to The Code Project. I don't know how fast it'll be updated but you can always see the latest version here www.kjellkod.cc/threadsafecircularqueue[^]
|
|
|
|
|

|
Very good introduction
to one of the main tools
in real time programming
techniques.
|
|
|
|

|
Could just be that I found this just at the time I needed it but it explains the issues clearly and demonstrates a workable solution simply enough that I feel confident to implement it in my code and still be able to fathom it out in a year or two when I need to update it.
Many thanks.Dave Cross
|
|
|
|

|
I have been using such lock-less circular queues in mainframe system software since about 1985. While the details are obviously quite different (I was not using C++, an X86, nor Windows), I will absolutely state that the mechanism is viable and valuable. The systems involved had from 1 to 24 processors (and perhaps even more in the recent years), with similar (but slightly different, obviously) atomicity, synchronicity and caching issues. I recall that professor E.W.Dijkstra (of the Netherlands) had an interesting paper on this subject in the early 1980's. This is what turned me on to this mechanism; alas I have no copy of that paper (nor any of his other interesting papers of that era).
An important omission of this article is the handling of a full/empty circular queue; a viable and straightforward solution is to employ some signal/wait scheme in those circumstances. This of course, uses "locks" of some sort, but the frequency of their use is trivial if the size of the circular queue is well managed.
This lock-less mechanism becomes very valuable when used on the "single" side of a single-producer/multi-consumer queue or its almost twin, the multi-producer, single-consumer queue. As a further valuable adaptation, the "single" partner can be engineered to have (for example) an array of single/single circular lock-less queues for communicating with a multitude of partners, thus effectively eliminating the "multi" aspect of its partner(s); this statement of course ignores a significant amount of implementation details.
It was not always easy to figure out exactly where to employ this mechanism (especially as a retrofit!), but where it was employed it exhibited seriously exponential performance improvements over previously used communication schemes (to a limit of course!) - meaning that as the number of processes on the "multi" side increased (perhaps well into the thousands) and the number of processors increased (at least to about 20 (the biggest I had to work with was 24)) the whole-system throughput increased much better than linearly (because the queues were seldom in the full/empty state; the signal/wait being relatively expensive).
Good Article, Valuable Material, Thanks!!
|
|
|
|

|
Bill Gord wrote:
An important omission of this article is the handling of a full/empty circular queue;
Yes, definitely. In my experience the queue size should be made large enough that the full criterion almost never happen (full being more of a pain than empty). But of course IF they happen they must be dealt with. A few different scenarios are outlined below - it's all depending on the system requirements how the situation should be resolved.
Possible things to do when need to insert, but queue is full
- wait till space for insertion (wait: 'inside' the queue)
- wait till space for insertion (wait: 'outside' the queue)
- by overwriting another item
- by throwing away items that were to be inserted, until space is available again
- crash hard, since the mechanism doesn't allow for any of the above. Increase queue size a lot and test again
Thanks Bill for your comment
I'm happy you liked it.
Kjell Hedström
|
|
|
|
|

|
>To take care of the cache and compiler re-order issue we make sure that head, tail and the array storage are all volatile. This means that the compiler is not allowed to re-order the head and tail indexes nor the array. That they are volatile also means that any reading of these will be made from memory and not from an old cached value.
There are five things to be considered when designing lockless algorithms:
- atomicity;
- CPU instruction reordering;
- CPU read/write reordering;
- compiler read/write reordering (code rearrangement);
- compiler optimizations;
Atomicity is assumption that operation will not be interrupted, f.e. value of specified size will be written to memory in single transaction, and another CPU core will not see partially updated variable. 32 bit ints are written atomically on x86. One should use InterlockedIncrement() ( lock: inc [eax] in assembly ) to make atomic var++.
CPU instruction reordering is hardware reordering of machine codes to optimize load of multiple execution units (f.e two U and V units on old Pentium). x86 does not allow rearranging memory access instructions.
CPU write reordering can change order in which writes reach memory ( code writes a, then b, b can arrive to memory first ). x86 does not allow read/write reordering.
Compiler read/write reordering: compiler may want to rearrange code, which can result in different order of variables reading/writing.
Compiler optimizations: compiler may want to keep variable in register and update memory only at the end of block.
What 'Volatile' modifier do, which was original used to tag hardware devices mapped as memory, is inform compiler that variable should be:
- read from memory every time it is read in code;
- writeln every time it is writeln in code;
- 'volatile' variable access is also code rearrangement barrier: no lines can be moved up or down through it (you can threat every access to volatile variable as horizontal barrier in code editor ).
Volatile does not warranty cache coherency and write reordering resolve. Cache coherency is actually warrantied by MESI protocol, used on x86:
http://en.wikipedia.org/wiki/MESI_protocol[^]
Here is a nice article for deep understanding of lockless programming problems:
http://msdn.microsoft.com/en-us/library/ee418650(VS.85).aspx[^]
As for performance, I doubt this queue will be any faster than classic Critical section based queue, because it was designed with focus on implementing lockless algorithm - not on performance ( it is just a tricky exercise to write lockless queue ). If you want to get really fast queue - design in performance in mind first. The main problem here is false sharing, of course: head, tail and array[100] are situated in single cache line - and it means cache miss for every write (very slow !). std::list with CS would probably work faster .
Busy waiting is not the best way to burn CPU cycles; consider at least this implementation with proper waiting on full/empty cases:
Fast synchronization between a single producer and single consumer
http://www.bluebytesoftware.com/blog/2009/10/05/FastSynchronizationBetweenASingleProducerAndSingleConsumer.aspx[^]
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
How to make a wait-free, lock-free CircularFifo using C++11
| Type | Article |
| Licence | Public Domain |
| First Posted | 4 Nov 2009 |
| Views | 79,935 |
| Downloads | 2,374 |
| Bookmarked | 117 times |
|
|