Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Developing a Truly Scalable Winsock Server using IO Completion Ports

0.00/5 (No votes)
22 Sep 2001 281  
Developing a Truly Scalable Winsock Server using IO Completion Ports

Requirements

The article expects the reader to be familiar with the C++, Winsock API 2.0, MFC, Multithreading.

Windows NT/2000 or later: Requires Windows NT 3.5 or later
Windows 95/98/Me: Unsupported

Motivation

This article which attempts to deal with the thorny issue of using Completion Ports with Windows Sockets. It also addresses some concerns of previous readers from the last article. Portions of the code and been reengineered so its worth downloading again if you've haven't already done so

The article expects the reader to be familiar with the Winsock API 2.0, MFC, Multithreading. 

I have recently been working on a project that required me to develop a high performance TCP/IP server, typically a server similar to a Web Server, where a large amount of clients can connect and exchange data. 

The initial design of my server was developed with a 1 thread per TCP/IP client interface, I initially thought this was a good solution until I read an  article on High-load servers which suggested that the server could get into a state of "Thread Thrashing" as  the threads awake to service the client connection and the operating system could possibly run out of system resources. Another problem, I was using WSAAsyncSelect for each client, the problem here Winsock is limited to 64 event handles - whoops.  The solution to the problem was to develop a server with I/O Completion Ports.  

During my research into I/O Completion ports, I found very few articles and code samples on real world applications, especially demonstrating writing data back to a client. This prompted me into writing this article.

Design

Instead creating 1 thread per client - hence 1000 clients a 1000 threads, we create a Pool of worker threads to service our I/O events, I will discuss the Worker Threads more later in the article. 

To begin using completion ports we need to create a Completion Port which in turn creates a number of concurrent threads (threads that exist with the Completion Port - Not to be confused with Worker Threads)  that you specify. See function prototype below. 

HANDLE CreateIoCompletionPort ( HANDLE FileHandle, // handle to file

HANDLE ExistingCompletionPort,  // handle to I/O completion port

ULONG_PTR CompletionKey,        // completion key

DWORD NumberOfConcurrentThreads // number of threads to execute concurrently );

Specifying zero for the NumberOfConcurrentThreads will create concurrent threads as there are CPUs on the system. You can change this value to experiment with performance, but for the purpose of this article and code we will use the default value zero. 

Once the Completion Port has been created, the next step is to associate all accepted sockets with the Completion Port. The call to do this is CreateIoCompletionPort, this is somewhat confusing and its probably better to call a function like AssociateSocketWithCompletionPort to do the job for you. Here's what AssociateSocketWithCompletionPort looks like: 

BOOL CClientListener::AssociateSocketWithCompletionPort(SOCKET socket, 
                                                        HANDLE hCompletionPort, 
                                                        DWORD dwCompletionKey)
{
	HANDLE h = CreateIoCompletionPort((HANDLE) socket, hCompletionPort, dwCompletionKey, 0);
	return h == hCompletionPort;
}

You'll notice that AssociateSocketWithCompletionPort requires a Completion key. A Completion key is essentially an OVERLAPPED structure with any other data you want to associate with the completion port and socket. Examine the class below:

struct ClientContext 
{
OVERLAPPED m_Overlapped;
LastClientIO m_LastClientIo;
SOCKET m_Socket;

// Store buffers

CBuffer m_ReadBuffer;
CBuffer m_WriteBuffer;

// Input Elements for Winsock

WSABUF m_wsaInBuffer;
BYTE m_byInBuffer[8192]; 

// Output elements for Winsock

WSABUF m_wsaOutBuffer;
HANDLE m_hWriteComplete;

// Message counts... purely for example purposes

LONG m_nMsgIn;
LONG m_nMsgOut; 
};

The reason why a ClientContext is associated with a socket and completion port, is so we can keep a track of the socket when the I/O is dequeued in the Worker Threads.

Now that the socket ha been attached/associated with the Completion Port, we can discuss the Worker Threads in detail.

We create the worker threads during the creation of the completion port, the worker threads handles are closed upon creation as they are not needed.

The worker threads now wait on GetQueuedCompletionStatus. When an I/O is request and been serviced it is queued in the Completion Port the last Worker thread to issue a GetQueuedCompletionStatus  is woken and the I/O can be processed. See GetQueuedCompletionStatus  below, notice it returns a Completion Key, with this we can keep track of our associated socket.

BOOL GetQueuedCompletionStatus(
HANDLE CompletionPort,      // handle to completion port

LPDWORD lpNumberOfBytes,    // bytes transferred

PULONG_PTR lpCompletionKey, // file completion key

LPOVERLAPPED *lpOverlapped, // buffer

DWORD dwMilliseconds        // optional timeout value

);

A rule of thumb for the number of Worker threads = 2 * CPU on the system, this is a heuristic value and is explained in detail by Jeffery Richter in "Programming Server Side Applications for Windows 2000". I've included in the source code sample a dynamic thread pooling algorithm (This is not implemented in the example), but you can experiment with the following values (Remember to adjust the NumberOfConcurrentThreads accordingly).

m_nThreadPoolMin  // The minimum threads in the pool

m_nThreadPoolMax  // The maximum threads allowed in the pool

m_nCPULoThreshold // The CPU threshold when unused threads can removed from the Worker ThreadPool

m_nCPUHiThreshold // The CPU threshold when a thread can be added to the Worker ThreadPool

Now we have the process in place, its time to show the Completion Port architecture in diagram form below: 

The worker threads must issue a IO Request either by a WSARead or WSAWrite, they then wait on GetQueuedCompletionStatus for the IO complete. Once the IO is completed GetQueuedCompletionStatus returns and the data can be processed.

So on a dual processor box we could quite comfortably handle 2000+ (Depending on data throughput and workload etc.) clients with only 4 threads.

In my IOCP_Server example I have a class CListener, which accepts TCP/IP clients and associates with a Completion Port, CListener also holds a list of ClientContexts (for stats/referencing).

I have created my own data protocol for incoming/outgoing data packets, this is a 4 byte (integer) header (containing the size of the packet) and the actual packet.  e.g. 0500HELLO. This protocol is used to exchanged data to and from the client.

The Project

Included in the project for completeness is a CBuffer class to hold incoming and outgoing data, a CCpuUsage class for the ThreadPool allocation/Deallocation.

Our code includes map to route the requests to function handlers, see below:

// Here we use the natural (well...) way to neatly handle each queued status

BEGIN_IO_MSG_MAP()
IO_MESSAGE_HANDLER(ClientIoInitializing, OnClientInitializing)
IO_MESSAGE_HANDLER(ClientIoRead, OnClientReading)
IO_MESSAGE_HANDLER(ClientIoWrite, OnClientWriting)
END_IO_MSG_MAP()

bool OnClientInitializing (ClientContext* pContext, DWORD dwSize = 0);
bool OnClientReading (ClientContext* pContext, DWORD dwSize = 0);
bool OnClientWriting (ClientContext* pContext, DWORD dwSize = 0);

Example Project

Well the best thing to do is fire up the examples and play with it. There's plenty of comments littered throughout the code.

For example set the client up so it sends 99999 "Test Item " messages, it takes around 3 seconds and the CPU usage hardly flinches. Wow.

The example MFC project contains the Server code which displays clients accepting/connecting and any incoming data read from the IO Port. It also allows data to be sent to a specified connected client.

Also included is a MFC Client. which sends and receives data and has a flood option or sending the same string repeatedly.

This should be a good jumpstart for anybody wanting to create a High performance Client/Server application for Windows NT/2000.

The server listens on port 999, please change in the client/server program, if this conflicts with your system.

Any corrections,  enhancements or suggestions please don't hesitate to contact me.

Credits

Firstly I like to thank Ulf Hedlund for taking time to fix some of the subtle problems with the code, and I'd also like to thank many other readers you have sent in comments and suggestions.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here