The purpose of this article is to explore the IO Completion Port mechanism provided in Windows and compare it with the other mechanisms available to wait for an IO completion.
IO Completion Port mechanism was introduced in Windows to develop scalable server applications. The idle server application should be able to serve multiple clients without sacrificing the overall performance. To write a scalable server application, the developer should be quite comfortable with two basic techniques: asynchronous I/O and threads. The implementation of multithreading in a server application should be done carefully because the overall performance of the server application can go down due to excess context switching. The increase in context switch is directly proportional to the ratio of the threads spawned to the number of processors available. The advantages and disadvantages that come along with threads should be kept in mind while designing a scalable server application. The frequent switch over between various threads can put of lot of overheads on Windows Memory Manager because of swapping.
Three years back, I wrote my first server application that used to spawn three threads per client, and most of the synchronization between the threads was done by using a system call
WaitForSingleObject which blocks the running thread until the event object passed in the mentioned system call gets signaled. The excess use of
waitXXX APIs can cause lot of context switches and hence overheads on the server application. I was having no choice because three threads used to make a connection with three databases using a third party database library and used to fetch the result synchronously and hence there was no option to fit asynchronous I/O mechanism in my server application.
Note: Always try to design an application which should have minimum number of context switches.
Asynchronous I/O is a mechanism through which a calling thread can submit an I/O request and return before the submitted request gets served by the kernel. The calling thread can utilize the returned control to do additional processing before it goes to sleep and wait for the submitted IO request to get completed. There are four ways by which the calling thread can know the status and result of the submitted asynchronous I/O request.
- Event Object: The event object can be set into an
OVERLAPPED structure that can then be passed to a
WriteFile function being called on an I/O object that has been opened with
FILE_FLAG_OVERLAPPED. On the completion of the I/O request, the event object set in the
OVERLAPPED structure will get signaled. Any wait function can be used to synchronize the I/O. This method provides a greater flexibility if multiple threads needs to perform I/O on the same file because different threads can wait on different event objects being specified in the
OVERLAPPED structure corresponding to each submitted I/O operation. In this scenario, the calling thread needs to wait by calling
waitXXX APIs to fetch the status of the submitted asynchronous I/O request.
- Using File Object: The calling thread can wait on the file handle to synchronize the completion of submitted I/O operations. This method has a disadvantage when an application starts two I/O, the application waiting on a file handle has no way to determine which I/O completion has got completed and has set the file handle state to signaled. In this case also, calling thread needs to wait for the completion of the I/O request.
- GetOverlappedResult: The calling thread can synchronize itself with the completion of the overlapped I/O operation by calling the
GetOverlappedResult function. This function will behave as a wait function if the
bWait is set to
TRUE. It is always safe to use the event object in the
OVERLAPPED structure passed to this function. If the
hEvent data member of the
OVERLAPPED structure is
NULL, then the kernel uses the state of the
hFile object to signal when the submitted I/O operation has completed.
- IO Completion Port: IO Completion Port is a kernel object which can be associated with a number of I/O objects like file, socket or a pipe. When an asynchronous operation is started on the I/O objects that have been associated with the IO Completion Port, the calling thread returns immediately and on the completion of the submitted I/O request, the kernel dispatches the completion packet to the queue associated with the IO Completion Port.
INDEPTH IO Completion Port
IO Completion Port was introduced in Windows to suite the needs of an architecture which could best fit in a server application. The server application should be able to serve multiple clients with a limited number of threads, which constitute the thread pool. The number of these limited threads depends upon the number of processors. An IO Completion Port and asynchronous I/O operations can be clubbed together to design a scalable server application. The
CreateIoCompletionPort function associates an I/O object with the I/O completion port. When an I/O operation is performed on the I/O object associated with the IO Completion Port, on the completion of the submitted I/O request, the kernel I/O system dispatches the IO Completion packet to the queue associated with the IO Completion Port. The pool of worker threads will wait on the IO Completion port by calling the
The threads that get blocked on the IO Completion Port will unblock on a LIFO basis, as it is best for the server application architecture because it reduces the number of context switches which is the basic design feature required for these types of applications. Once the worker thread completes its processing of IO Completion packets, the thread again calls
GetQueuedCompletionStatus to fetch and process the next IO Completion packet in a queue. If the packet does not exist, the thread will get blocked and wait on the IO Completion Port for packets to be delivered. Here is the beauty and the strength of an IO Completion Port, the higher efficiency will provide better performance as there will be less context switches. The more will be the load on the sever, the lesser will be the context switches as the worker thread will never block because the call to
GetQueuedCompletionStatus will always fetch queued I/O Completion packets.
The kernel controls the number of runnable threads associated with the IO Completion Port so that it should not exceed the port’s concurrency limit. The kernel controls this by blocking the thread which has called
GetQueuedCompletionStatus until the total number of runnable threads associated with the IO Completion Port is below the concurrency limit set at the time of the creation of the IO Completion Port. When
GetQueuedCompletionStatus returns, it just signifies that the submitted IO has completed on some I/O object. How does a server application identify an I/O object on which I/O has completed? This important information is provided by the key parameter of
GetQueuedCompletionStatus function that is returned to the caller. This key is passed to the IO Completion Port at the time of its association with the I/O object. The interpretation of the key depends on the server application, which can use it to uniquely identify the client info block in an array of blocks storing client information.
We will be looking into an example, which will explore an implementation of the I/O Completion Port mechanism.
In our example, we will be copying a file of a large size into a destination file with an implementation of the I/O Completion Port. This technique will help in copying large files in the shortest possible time. Fig.1 will explain the design of the application and what we are trying to achieve from an example.
The implementation is done in such a way that the number of threads constituting a thread pool will be dependent on the size of the source file. Therefore the larger the file size, the more the numbers of threads the will be spawned and faster will be the write operation on the destination file.
The source code is self-explanatory as it is commented out to explain each step. There are some important points that need to be pointed out in the source code.
- The source file and destination file have been opened with
FILE_FLAG_NO_BUFFERING flag as this will ensure that there will be no caching of data by I/O drivers and hence all the read and write operations will be asynchronous. Data caching causes asynchronous I/O calls to behave as synchronous calls.
- If any write operation on a file exceeds the length of the file, then all these asynchronous write operations will be treated as synchronous calls. To avoid this case, we have already increased the size of the destination file such that its size should be greater than the size of the source file and the multiple of the page size.
We will see that in our example, the size of the destination file will always be greater than the source file as we have opened these files with the
FILE_FLAG_NO_BUFFERING flag. The file pointer of such files can only be set to the positions which are multiples of the sector size.
The pool of threads, in our case Writer Threads, will wait for packets on the IO Completion Port by using the
bSuccess = GetQueuedCompletionStatus(hIOCompPort,
Note: In this,
overlappedComp is of type
OVERLAPPED, a structure which contains info about asynchronous I/O.
The writer threads should not interfere with each other, i.e., they should not try to write in the same position in a file as it will corrupt a file. The offset and the number of bytes to write in a file is maintained in an array of
BUFFER_DATA structure. Each thread is having its own structure and hence no locking is required which gives better performance.
The reader threads will issue an asynchronous read call on the file and the number of threads spawned will be directly proportional to the size of the file. The code snippet for the reader thread is as follows:
lpBufferData[iReadAsynOperations].overLapped.Offset = fileReadPosition;
lpBufferData[iReadAsynOperations].overLapped.OffsetHigh = 0;
lpBufferData[iReadAsynOperations].overLapped.hEvent = NULL;
bSuccess = ReadFile(hSource,
In this, the
fileReadPosition is incremented by
BUFFER_SIZE each time before spawning a new thread. The
BUFFER_SIZE represents the number of bytes which can be read in a single
The IO Completion Port and asynchronous sockets can be used to design an optimum solution for a scalable server application.