Click here to Skip to main content
Click here to Skip to main content

A High Performance Lock Free Ring Queue

, 3 Apr 2014 CPOL
Rate this:
Please Sign up or sign in to vote.
A high performance queue comparable to Boost lock free queue

Introduction

In this tip, I coded a high performance lock free ring queue in C++11, with a million data items enqueued and dequeued in about 0.16 second, while famous Boost library lock free queue takes about 0.24 second to finish the exact same actions, under cygwin unix-windows operating system, using gcc 4.8 compiler, in a duel core Intel i7- 3520M@2.9GHz, 2.9GHz machine.

Background

Improving the performance of a software can use multithreading nowadays, when most computers sold now have more than one core. However multithreading is not an easy task, and developers need face synchronization problem, and could end up undefined result and/or worse performance.

The blocking synchronization mechanism was developed first using mutex, semaphore, etc. The blocking synchronization works well with this type of the applications, which are idle most time, waiting for some events to come in most time from some threads. However, the performance of this synchronization gets worse when the idle time is less, because operating system must preemptive the process more, therefore more context switch, and cache thrashing. In addition, the performance suffers more when the number of core increases with more concurrent running thread running on more core, so contending more, etc.

Because of the shortcomings of the blocking synchronization mechanism, lock free non blocking synchronization mechanism was developed. The famous Boost library developed lock free data structures.[1] There are some articles on CodeProject which developed lock free structures[2][3]. Lock free synchronization mechanism decrease the frequency in which processes are preempted, reducing, therefore, cache trashing. In addition, contention between threads drops considerably since there is no lock to protect any data. Threads basically claim space first and occupy with data (including pointers to the data). Therefore, moving the hottest data structure from locked to lock free design can improve software performance significantly.

One of the most popular data structures is the queue, which could have hundreds of thousands of enqueue and dequeue operations per second from tens of producers and/or consumers from different threads. The performance of the queue is measured as the throughput it can handle, specifically, number of data enqueued and dequeued per second.

Implementation

I implement the queue using an array storing pointers to the data items, and the memories of data item are pre allocated and item objects are created before the queue actions. It has good performance especially comparing to store copies of data items, since it does not have the cost of run time heap memory allocation. Runtime memory allocation is not only costly since the heap memory fragmentation problem, but also dangerous under multithreading since potential over allocation leading memory leak.

The array in my queue is a fixed size array in order to achieve best performance. To solve the problem caused by fixed size array, I make the index of the array come back to 0 after reaching the end the array. That is why it is called ring queue. The lock free enqueue is achieved by atomic operation compare and swap(CAS) on enqueue index. The lock free dequeue is achieved by CAS on dequeue index. Some atomic variables are used to synchronize between enqueue and dequeue. Let us take a look at the data member in my lock free ring queue, and the comments I put in explains the purpose of each data member.

template < typename _TyData, long _uiCount = 100000 >
class lfringqueue
{
private:
    std::atomic<long> m_lHeadIterator;  // enqueue index
    std::atomic<long> m_lTailIterator;  // dequeue index
       _TyData **m_queue;                  // array of pointers to the data
    long m_uiCount;                     // size of the array
    std::atomic_flag m_lEventSet;       // a flag to use whether we should change the item flag
        std::atomic_flag m_bHasItem;        // a flag to indicate whether there is an item enqueued
};

The key functions in the queue are enqueue and dequeue function. Please read those and comment if there are any.

Test and Compare with Boost Lock Free Queue

The test is the simulation of real financial trading world situation. A random price comes in from a thread, and we need look up all financial instruments to find the pointer to that particular object, storing the information related to that instrument, and then enqueue the pointer to the queue. Different threads dequeue items from the queue to process the price updates. Here is the main function in the test.

int main(int argc, char** argv) 
{
    for ( int i = 1; i < NUM_DATA + 1; i++ )
    {
        Data data(i);
        DataArray[i-1] = data;
    }
     
    std::thread PublishThread[NUM_ENQUEUE_THREAD]; 
    std::thread ConsumerThread[NUM_DEQUEUE_THREAD];
    std::chrono::duration<double> elapsed_seconds;
  
    for ( int i = 0; i < NUM_ENQUEUE_THREAD;  i++ )
    {
        PublishThread[i] = std::thread( GenerateRandomNumber_FindPointerToTheNumber_EnQueue); 
    }
    
    auto start = std::chrono::high_resolution_clock::now();
    for ( int i = 0; i < NUM_DEQUEUE_THREAD; i++ )
    {
        ConsumerThread[i] = std::thread{ Dequeue};
    }
    
    for ( int i = 0; i < NUM_DEQUEUE_THREAD; i++ )
    {
        ConsumerThread[i].join();
    }   

    auto end = std::chrono::high_resolution_clock::now();
    elapsed_seconds = end - start;
    std::cout << "Enqueue and Dequeue 1 million item in:" 
        << elapsed_seconds.count() << std::endl;   

    for ( int i = 0; i < NUM_ENQUEUE_THREAD; i++ )
    {
        PublishThread[i].join();
    }
             
    return 0;
}

Since it is running in duel core machine, it is not quite surprising that using two threads (one for enqueue, one for dequeue) give us the best performance. It only takes about 0.16 seconds to finish enqueue and dequeue one million data, including a binary search to locate a number between one and ten, under cygwin unix-windows operating system, using gcc 4.8 compiler, in a duel core Intel i7- 3520M@2.9GHz, 2.9GHz machine For Boost lock free queue, the best configuration is not one enqueue thread, one dequeue thread, it is one enqueue thread, two dequeue thread. It takes about 0.27 seconds to do the exact same actions.

History

  • 04/03/2014: First version
  • 04/03/2014: Add source code

Literature

  1. http://www.boost.org/doc/libs/1_55_0/doc/html/lockfree/examples.html
  2. http://www.codeproject.com/Articles/153898/Yet-another-implementation-of-a-lock-free-circular
  3. http://www.codeproject.com/Articles/23317/Lock-Free-Queue-implementation-in-C-and-C

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

PengHeProfessor
Software Developer (Senior)
United States United States
I have been developing low latency high throughput applications, services and platforms in financial trading industry since 2004, mostly in C++, some in C#.net and Java.

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web01 | 2.8.141022.2 | Last Updated 3 Apr 2014
Article Copyright 2014 by PengHeProfessor
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid