Click here to Skip to main content
Click here to Skip to main content

Developing a Truly Scalable Winsock Server using IO Completion Ports

By , 22 Sep 2001
 

Requirements

The article expects the reader to be familiar with the C++, Winsock API 2.0, MFC, Multithreading.

Windows NT/2000 or later: Requires Windows NT 3.5 or later
Windows 95/98/Me: Unsupported

Motivation

This article which attempts to deal with the thorny issue of using Completion Ports with Windows Sockets. It also addresses some concerns of previous readers from the last article. Portions of the code and been reengineered so its worth downloading again if you've haven't already done so

The article expects the reader to be familiar with the Winsock API 2.0, MFC, Multithreading. 

I have recently been working on a project that required me to develop a high performance TCP/IP server, typically a server similar to a Web Server, where a large amount of clients can connect and exchange data. 

The initial design of my server was developed with a 1 thread per TCP/IP client interface, I initially thought this was a good solution until I read an  article on High-load servers which suggested that the server could get into a state of "Thread Thrashing" as  the threads awake to service the client connection and the operating system could possibly run out of system resources. Another problem, I was using WSAAsyncSelect for each client, the problem here Winsock is limited to 64 event handles - whoops.  The solution to the problem was to develop a server with I/O Completion Ports.  

During my research into I/O Completion ports, I found very few articles and code samples on real world applications, especially demonstrating writing data back to a client. This prompted me into writing this article.

Design

Instead creating 1 thread per client - hence 1000 clients a 1000 threads, we create a Pool of worker threads to service our I/O events, I will discuss the Worker Threads more later in the article. 

To begin using completion ports we need to create a Completion Port which in turn creates a number of concurrent threads (threads that exist with the Completion Port - Not to be confused with Worker Threads)  that you specify. See function prototype below. 

HANDLE CreateIoCompletionPort ( HANDLE FileHandle, // handle to file
HANDLE ExistingCompletionPort,  // handle to I/O completion port
ULONG_PTR CompletionKey,        // completion key
DWORD NumberOfConcurrentThreads // number of threads to execute concurrently );

Specifying zero for the NumberOfConcurrentThreads will create concurrent threads as there are CPUs on the system. You can change this value to experiment with performance, but for the purpose of this article and code we will use the default value zero. 

Once the Completion Port has been created, the next step is to associate all accepted sockets with the Completion Port. The call to do this is CreateIoCompletionPort, this is somewhat confusing and its probably better to call a function like AssociateSocketWithCompletionPort to do the job for you. Here's what AssociateSocketWithCompletionPort looks like: 

BOOL CClientListener::AssociateSocketWithCompletionPort(SOCKET socket, 
                                                        HANDLE hCompletionPort, 
                                                        DWORD dwCompletionKey)
{
	HANDLE h = CreateIoCompletionPort((HANDLE) socket, hCompletionPort, dwCompletionKey, 0);
	return h == hCompletionPort;
}

You'll notice that AssociateSocketWithCompletionPort requires a Completion key. A Completion key is essentially an OVERLAPPED structure with any other data you want to associate with the completion port and socket. Examine the class below:

struct ClientContext 
{
OVERLAPPED m_Overlapped;
LastClientIO m_LastClientIo;
SOCKET m_Socket;

// Store buffers
CBuffer m_ReadBuffer;
CBuffer m_WriteBuffer;

// Input Elements for Winsock
WSABUF m_wsaInBuffer;
BYTE m_byInBuffer[8192]; 

// Output elements for Winsock
WSABUF m_wsaOutBuffer;
HANDLE m_hWriteComplete;

// Message counts... purely for example purposes
LONG m_nMsgIn;
LONG m_nMsgOut; 
};

The reason why a ClientContext is associated with a socket and completion port, is so we can keep a track of the socket when the I/O is dequeued in the Worker Threads.

Now that the socket ha been attached/associated with the Completion Port, we can discuss the Worker Threads in detail.

We create the worker threads during the creation of the completion port, the worker threads handles are closed upon creation as they are not needed.

The worker threads now wait on GetQueuedCompletionStatus. When an I/O is request and been serviced it is queued in the Completion Port the last Worker thread to issue a GetQueuedCompletionStatus  is woken and the I/O can be processed. See GetQueuedCompletionStatus  below, notice it returns a Completion Key, with this we can keep track of our associated socket.

BOOL GetQueuedCompletionStatus(
HANDLE CompletionPort,      // handle to completion port
LPDWORD lpNumberOfBytes,    // bytes transferred
PULONG_PTR lpCompletionKey, // file completion key
LPOVERLAPPED *lpOverlapped, // buffer
DWORD dwMilliseconds        // optional timeout value
);

A rule of thumb for the number of Worker threads = 2 * CPU on the system, this is a heuristic value and is explained in detail by Jeffery Richter in "Programming Server Side Applications for Windows 2000". I've included in the source code sample a dynamic thread pooling algorithm (This is not implemented in the example), but you can experiment with the following values (Remember to adjust the NumberOfConcurrentThreads accordingly).

m_nThreadPoolMin  // The minimum threads in the pool
m_nThreadPoolMax  // The maximum threads allowed in the pool
m_nCPULoThreshold // The CPU threshold when unused threads can removed from the Worker ThreadPool
m_nCPUHiThreshold // The CPU threshold when a thread can be added to the Worker ThreadPool

Now we have the process in place, its time to show the Completion Port architecture in diagram form below: 

The worker threads must issue a IO Request either by a WSARead or WSAWrite, they then wait on GetQueuedCompletionStatus for the IO complete. Once the IO is completed GetQueuedCompletionStatus returns and the data can be processed.

So on a dual processor box we could quite comfortably handle 2000+ (Depending on data throughput and workload etc.) clients with only 4 threads.

In my IOCP_Server example I have a class CListener, which accepts TCP/IP clients and associates with a Completion Port, CListener also holds a list of ClientContexts (for stats/referencing).

I have created my own data protocol for incoming/outgoing data packets, this is a 4 byte (integer) header (containing the size of the packet) and the actual packet.  e.g. 0500HELLO. This protocol is used to exchanged data to and from the client.

The Project

Included in the project for completeness is a CBuffer class to hold incoming and outgoing data, a CCpuUsage class for the ThreadPool allocation/Deallocation.

Our code includes map to route the requests to function handlers, see below:

// Here we use the natural (well...) way to neatly handle each queued status
BEGIN_IO_MSG_MAP()
IO_MESSAGE_HANDLER(ClientIoInitializing, OnClientInitializing)
IO_MESSAGE_HANDLER(ClientIoRead, OnClientReading)
IO_MESSAGE_HANDLER(ClientIoWrite, OnClientWriting)
END_IO_MSG_MAP()

bool OnClientInitializing (ClientContext* pContext, DWORD dwSize = 0);
bool OnClientReading (ClientContext* pContext, DWORD dwSize = 0);
bool OnClientWriting (ClientContext* pContext, DWORD dwSize = 0);

Example Project

Well the best thing to do is fire up the examples and play with it. There's plenty of comments littered throughout the code.

For example set the client up so it sends 99999 "Test Item " messages, it takes around 3 seconds and the CPU usage hardly flinches. Wow.

The example MFC project contains the Server code which displays clients accepting/connecting and any incoming data read from the IO Port. It also allows data to be sent to a specified connected client.

Also included is a MFC Client. which sends and receives data and has a flood option or sending the same string repeatedly.

This should be a good jumpstart for anybody wanting to create a High performance Client/Server application for Windows NT/2000.

The server listens on port 999, please change in the client/server program, if this conflicts with your system.

Any corrections,  enhancements or suggestions please don't hesitate to contact me.

Credits

Firstly I like to thank Ulf Hedlund for taking time to fix some of the subtle problems with the code, and I'd also like to thank many other readers you have sent in comments and suggestions.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Norm .net
Software Developer (Senior) Software Kinetics
United Kingdom United Kingdom
Member



Software Kinetics
are experts in developing customised and bespoke applications and have expertise in the development of desktop, mobile and internet applications on Windows.

We specialise in:

  • User Interface Design
  • Desktop Development
  • Windows Phone Development
  • Windows Presentation Framework
  • Windows Forms
  • Windows Communication Framework
  • Windows Services
  • Network Applications
  • Database Applications
  • Web Development
  • Web Services
  • Silverlight
  • ASP.net
 
Visit Software Kinetics

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionStreaming Fails Reasonmemberzubair_ahmed18 Oct '06 - 5:53 
Thanx to the information provided by John M. Drescher, i have come to know that my data stream is getting corrpted by simaltaneous read/write on a single socket, that can be attributed to lack of thread syncronization.
 
Can someone please guide me how to solve this problem(data stream corruption), currently i have modified the server to have just one pending read for a client, therefore solving the out of order packets problem.
 
Thanx in Advance
 
Z.A

AnswerRe: Streaming Fails Reason [modified]memberzubair_ahmed31 Oct '06 - 18:42 
This solution is tested when you have just one pending receive or your receive operations are completing in order issued.
 
In OnClientReading find and replace the following line.
if (nSize && pContext->m_ReadBuffer.GetBufferLen() >= nSize)
 
with this one.
 
if (nSize && (pContext->m_ReadBuffer.GetBufferLen()-sizeof(int) >= nSize))
 
This reads the buffer at correct boundries and stops data corruption when message are being transferred very frequently in variable size chunks.
 

 

-- modified at 1:00 Wednesday 1st November, 2006
 
Z.A

QuestionWhere is the new release?memberonirps5 Oct '06 - 0:29 
Where can I download the new-release to fix the bugs???
 
Regards
GeneralProblems with stress testing this servermemberDaniel9200921 Jul '06 - 7:24 
Hi,
 
I downloaded and stress-tested this server.
 
I immediately ran into trouble with WSAENOBUFS errors. If one simply ignores the errors the server still sort of works, but it leaks the OVERLAPPEDPLUS structures associated with the failed WSARecv function call. I tried immediately freeing that structure... this resulted in the server crashing on completion. I did a kluge to put pointers in a large FIFO and free them when the FIFO was 90% full (and freeing the rest at exit). That stopped the memory leak and the crash.
 
I compared to the Microsoft SDK example. The biggest difference seems to be that this server allocates an OVERLAPPEDPLUS structure for each operation. The Microsoft SDK sample uses one OVERLAPPEDPLUS structure (they use a different name for the structure) for each client and they re-use it for each operation for a client (which means they don't read and write at the seme time for the same client). The Microsoft sample executes my stress test (send/echo/verify/send...) 3 times as fast as this server and it does not have the problem with the ESAENOBUFS error.
 
My guess is that the problem with this server that causes the WSAENOBUFS error is allocating the OVERLAPPEDPLUS object for each operation. Apparently in a situation with lots of packets coming and going this can result in a lot of OVERLAPPEDPLUS objects floating around and produce this error. Also, memory allocations are expensive. I'm guessing that the new and delete operations cause the 3X difference in performance in my stress test, but that could also be a result of re-doing operations that fail as a result of the WSAENOBUFS error.
 
-Daniel Hale
Sigh | :sigh:
 

AnswerRe: Problems with stress testing this servermemberpatricklavoie1 Dec '06 - 1:58 
Good morning,
 
The problem is not that the server allocates an OVERLAPPEDPLUS structure for each operation. The problem is that it does WSARecv after *each* operation. After the IOInitialize, it does a read (that makes sense). After a IORead it also does wait for another read (which also makes sense), but after a write operation is also does a read OMG | :OMG: , and that's where the problem is -- it should not issue a new read because there is already one read waiting in the queue. So after a short period of time, you end up having way too many reads waiting in the queue.
 
-Pat
AnswerRe: Problems with stress testing this servermemberDaniel920093 Dec '06 - 10:55 
Pat,
 
Thanks for contributing some more insight to this problem!
 
I wonder if it would be possible for you to post a modified version of this code which does not exhibit the WSAENOBUFS error.
 
It will be interesting to see how the code performs (compared to the SDK code) once this problem is fixed. Perhaps the slower performance is almost all due to the coding error and not due to repeated allocations of the OVERLAPPEDPLUS structure.
 
-Daniel
GeneralRe: Problems with stress testing this servermemberpatricklavoie4 Dec '06 - 13:56 
The fix is pretty simple. I'll send it to the author so he can update the sources. In the mean time, here it is:
 
bool CIOCPServer::OnClientWriting(ClientContext* pContext, DWORD dwIoSize)
{
ULONG ulFlags = MSG_PARTIAL;
 
// Finished writing - tidy up
pContext->m_WriteBuffer.Delete(dwIoSize);
if (pContext->m_WriteBuffer.GetBufferLen() == 0)
{
 
pContext->m_WriteBuffer.ClearBuffer();
// Write complete
SetEvent(pContext->m_hWriteComplete);
>>>> REMOVE THE NEXT LINE (LINE NO 892).. IT'LL ISSUE ONE UNNECESSARY READ FOR EACH WRITE
return true; // issue new read after this one

}
else
{

Question* * * New Release ????? * * *memberCWater7620 Jul '06 - 2:46 
Where can I download the new-release ???
 
The link is not reachable,
 
> http://www.ormerod.demon.co.uk/patches.htm
 
HELP HELP HELP
 
Regards

QuestionCBuffer ???memberCWater7618 Jul '06 - 5:24 
Where can I download the CBuffer.
 
The link is not reachable.
 
>I've posted a fix for CBuffer from my web site at:
 
>http://www.ormerod.demon.co.uk/patches.htm
 
>Regards
 
>Norm
 
Regards..
GeneralThere are several memory leaks and sometimes it gets down with exceptionmemberRoman7517 Jul '06 - 3:14 
There are several memory leaks and sometimes it gets down with exception. Please do not use it as working sample. OVERLAPPEDPLUS structure is not released after using

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130523.1 | Last Updated 23 Sep 2001
Article Copyright 2001 by Norm .net
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid