Deadlock Caused by Boost shared_ptr

Eric Z (Jing)

4.08/5 (8 votes)

Sep 27, 2016

CPOL

3 min read

12302

Deadlock caused by Boost shared_ptr

It was back in 2010 when one day our ATS reported a sudden system hang. When it happened, CPU load of real-time target was 100%. The bad thing was it’s not always reproducible. We found it was caused by some new module configuration tests. And it only showed up on single-core PXI systems (real-time PharLAP, x86), not on dual-core/quad-core RT. We failed to reproduce it with manual tests (we were just not that lucky). It was at a rather late phase of the project(Beta) and people started to panic.

Not too late, we found that hang caused by two threads stops working. The high priority thread was busy waiting on a lock that’s held by a low priority thread, which keeps it running forever, a deadlock!

Here’s some background. On this real-time target, there are two threads. One is low priority config thread, used to perform various configurations (e.g., accept settings from host PC and apply to stack on RT). The other is high priority scan thread, used to read/write I/O channels from network. The hang happened when we tried to change a module (a hardware entity comprised of physical I/Os) setting in config thread, while scan thread was also running.

Here are the code, simplified to be in focus.

Basically, there is a plugIn class which contains a map (handle-to-collection) for all collections. You can think of a collection as a property blob of a physical module. Each time a module configuration is changed, the according collection will be removed and readded. To prevent race condition with scan thread, both the read (by scan thread) and the write (by config thread) to a collection are protected by a mutex m_readWriteMutex.

// Called when there is a module configuration change
void Observer::ConfigureModule(handle, ModuleConfig *config)
{
    /*
     * Create CollectionInfo from module configuration "config"
     */

    m_plugin.RemoveIOCollection(handle);
    m_plugin.AddIOCollection(handle, collectionInfo);
}

// Config thread, low priority
void plugIn::AddIOCollection(int handle, CollectionInfo *info)
{
    CriticalSection guard(m_configMutex);
    shared_ptr<IOCollection> col = m_collections.find(handle);

    /*
     * Lengthy stuff: work on col for validation, create new collection
     * from "info", registration, etc.
     */

    CriticalSection guard(m_readWriteMutex);
    m_collections[handle] = newCollection;
}

// Config thread, low priority
void plugIn::RemoveIOCollection(int handle)
{
    CriticalSection guard(m_configMutex);
    shared_ptr<IOCollection> col = m_collections.find(handle);

    /* ... */

    CriticalSection guard(m_readWriteMutex);
    m_collections.erase(handle);
}

// Scan thread, high priority
void plugIn::ReadInputs(int handle, long* data)
{
    CriticalSection guard(m_readWriteMutex);
    shared_ptr<IOCollection> col = m_collections.find(handle);
    if (col)
    {
        col->read(data);
    }
}

// Scan thread, high priority
void plugIn::WriteOutputs(int handle, long* data)
{
    CriticalSection guard(m_readWriteMutex);
    shared_ptr<IOCollection> col = m_collections.find(handle);
    if (col)
    {
        col->write(data);
    }
}

This code works most of the time, but fails if unlucky. We finally traced it down to the root cause: copy of boost::shared_ptr. On PharLAP, we use a snapshot version of boost, which provides the following implementation of shared_ptr.

template<typename T> class shared_ptr
{
public:
    template<typename Y>
    shared_ptr(shared_ptr<Y> const &r): px(r.px), pn(r.pn) // never throws
    {
    }
private:
    T * px;                  // contained pointer
    detail::shared_count pn; // reference counte
};

class shared_count
{
private:
    counted_base * pi_;
public:
    shared_count(shared_count const &r): pi_(r.pi_) // nothrow
    {
        pi_->add_ref();
    }
};

class counted_base
{
private:
    mutable lightweight_mutex mtx_;
    long use_count_;

public:
    void add_ref()
    {
        lightweight_mutex::scoped_lock lock(mtx_);
        ++use_count_;
    }
};

class lightweight_mutex
{
private:
    long l_;

public:
    class scoped_lock
    {
    private:
        lightweight_mutex &m_;
public:
        explicit scoped_lock(lightweight_mutex &m): m_(m)
        {
            while(winapi::InterlockedExchange(&m_.l_, 1))
            {
                winapi::Sleep(0);
            }
        }
    };
};

To prevent data race in multi-threaded environment, this version of boost protects “++use_count_” with a lock. Surprisingly, this is a spin lock. If the low priority config thread holds the lock and scan thread kicks in (by a preemptive scheduler), it will be “blocked” and keeps running forever, since there will never be a chance for low priority thread (lock owner) to run to release the lock!

Why is it not seen on SMP system? Since there is another CPU core, the low priority thread then has a chance to run to release the lock.

Everyone was happy then. Two ways were ahead of us at that time. One was to depend on a different version of boost which implements “++use_count_” with atomic operation. Or we could simply change our code to replace copy with shared_ptr by reference. Given the potential risks (e.g., one of our dependencies was also using this version of busy-wait lock and they need pass shared_ptr objects somewhere in code to us), we finally took the second way as a quick workaround at that release.

// Same to WriteOutput
void plugIn::ReadInputs(int handle, long* data)
{
    CriticalSection guard(m_readWriteMutex);
    // Replace copy-by-value with const-ref
    const shared_ptr<CIOCollection> &col = m_collections.find(handle);
    if (col)
    {
        col->read(data);
    }
}

This workaround is safer than it first looks. It totally defeats the protection for data race of concurrent access to internal counter of shared_ptr. But in this case, it’s OK, thanks to m_readWriteMutex. If config threads runs first and deletes this collection, it won’t be found by read/write in scan thread. And config thread cannot kick in in the middle, because of thread priority.