Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C++ Threading
So I've been adding in some multi threading to my game demo. It all works fine until I create more threads than I have cores on my CPU.
 
I've only got a dual core so I have the main thread and an extra helper thread, if for whatever reason I create a third thread then performance drops considerably - like going from a super car to your dad's first Ford.
 
With a single thread I get 50 fps, with two threads I get 70 fps but with three threads I get 8 fps.
 
Using Very Sleepy to see what's going on; when I add the third thread a lot of time is spent in WaitForSingleObject and ReleaseSemaphore, but there is only one place I use that and the locks are going to get called the same number of times regardless of the number of threads because there's only a limited amount of data.
 
I create the threads at startup using CreateThread and they then wait for en event to signal that there is work for them to do, once the work is done they'll be waiting for the event again.
 
I've not done much threading so am I doing something immensely stupid that's causing the horrific performance?
Posted 18-Oct-10 13:16pm
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 2

Having more threads than cores will slow you down if you are already running at 100% cpu. However, I have written programs running dozens of threads on a dual core cpu without a problem. Just need to avoid deadlock and race conditions.
  Permalink  
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

It's harder to look at the actual problem when no actual code example is given, but based on your information I get to the following:
Because you have more threads than cores in a situation with limited data, you have a lot of overhead that causes the processor to stall.
 
This is way more clear in a real life example:
When doing dishes by hand, you have one person doing the washing and one doing the drying. When there is an extra person (thread) that has to compete to get a lock on the only brush (and sink) available, it is clear that this will get messy without an actual performance gain. Doing dishes this way certainly won't get any faster.
 
But how to speed it up? Well, if after drying an item you would have to walk from the kitchen to the living room to put that item away, it could help to leave the drying cloth at the sink and let another person (thread) take that resource and use it while the other person is putting an item away (meaning the lock on the drying cloth is released).
 
The conclusion here is only to use more threads than cores in a case where I/O operations are involved. That time is otherwise spent on waiting. When threads would otherwise have to compete on resources that are already available it won't speed up anything. In that cases you are making the processor crazy because threads are suspended to/restored from main memory, meaning you are thrashing the cache. The latency on that is enormous because the performance of a computer program is simply the maximum latency that the system encounters by executing it. By adding a thread you just added more latency intensive operations which drops performance drastically.
 
Good luck!
  Permalink  
v2
Comments
SK Genius at 19-Oct-10 8:59am
   
Hmm, so adding more threads is kinda like using too much memory and causing paging, or setting the max resolution in a game with the highest AA when your card doesn't have enough RAM to handle it all.
 
I always figured that threads would be handled better somehow. But still, like a good developer I am limiting the number of threads created based on the number cores to run them.
E.F. Nijboer at 19-Oct-10 11:19am
   
It's not that the thread is using a huge amount of ram. First thread 1 and 2 are working and thread 3 is waiting for a core to get available. Thread 1 is done and tries to acquire the lock. In case of 1 thread per core the kernel doesn't interrupt the thread and it can resume after the lock is acquired. But with the extra thread the kernel switches context, meaning that thread 3 is scheduled to execute. The processor cache is cleared and code and data of thread 3 is loaded. The cache controller probably didn't (and couldn't) anticipate to this and is way behind on getting the necessary data to the processor. The processor is at this time very inefficient because code and data is simply missing to get the actual work done. This is the case for each thread on every iteration. Each time the thread context will be switched with the waiting thread and that takes a huge amount of time because the quite intelligent cache controller could not foresee this tragedy of cache trashing.
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 4

Context switches are expensive. They are about 2 orders of magnitude slower than a method or function call ... thus threads can defeat expectations when used inappropriately.
 
As a single data point: on my Dell with Ubuntu 12.04 using C++, a pthread mutex semaphore enforced context switch benchmarks to 14 micro seconds. In contrast, the same machine can call "time(0)" (to fetch the wall clock time) in only 76 nano seconds (184x faster).
 

Another way to read this:
 
a context switch consumes (perhaps) 14,000 nanoseconds (14 us)
to change from processing frame i to frame i+1,
without processing either frame, just changing contexts.
 
a method call might use (perhaps) 76 nanoseconds
to change from frame i to i+1
(without processing either frame, and no context switch)
 
How much work might your code accomplish in that 13,924 nanoseconds?
 

 
Also note:
 
If a semaphore has already been unlocked() (i.e. semgive) when another thread attempts to lock() it (i.e. semtake), no context switch will occur, the 'lock()'ing task simply continues it's work. Perhaps you can rearrange your sequencing so that part of the software can queue the work queue such that the 'helper task' keeps the semaphores unlocked? With two cooperating threads, I suspect you will still have 2 context switches, one to process all the frames to output, and another one to 'arm' and signal all 'ready' semaphores. The goal would be to avoid flip/flop'ing from one thread to the other for each frame. (And there is probably a simple way to accomplish that.)
  Permalink  
v2
Comments
H.Brydon at 25-Aug-13 21:49pm
   
Interesting answer but why are you answering a 3 year old question?
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 5

  Permalink  

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



Advertise | Privacy | Mobile
Web02 | 2.8.1411022.1 | Last Updated 26 Aug 2013
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100