Another back-of-the-envelope estimate:
You probably understand that context switches are typically more than 2 orders of magnitude slower than a function (or method) invocation. Perhaps your efforts with the thread pool and number of clients are reducing the number context switches, but I think they won't be enough.
As a single data point: on my Dell, with dual core Athlon 2, running Ubuntu 12.04, and reporting it's best bogoMIP as 5210.77 for each processor:
--> a single pthread-mutex-semaphore enforced context switch benchmarks to 14 us.
i.e. (50,000 * 14 us) adds to 700 ms, or 0.7 seconds, or 70% of a single cores bandwidth, and none of your code has yet run, none of your data has yet transferred
While multicore can speed up some of this, you should consider rethinking how to use threads.
I suggest you temporarily scale back to a few clients (like 1 or 2), then measure the durations of the various sections of your server code. Find the bottle necks, don't guess.
For instance, wouldn't commenting out beginsend() eliminate all data transmits? Perhaps it would be better to try and measure the duration of a functioning beginsend(), instead of trying to guess what it means to eliminate it and all the events it triggers.
I use the following for measurement, perhaps this might help :
uint64_t start_us = getSystemMicrosecond();
uint64_t duration_us = getSystemMicrosecond() - start_us;
And uint64_t getSystemMicrosecond(void) is trivially constructed from Linux function
int clock_gettime(CLOCK_REALTIME, &ts)