There are several tasks in game development tools programming which require large computational power. For example, it takes up to 24 hours to compute game level lighting in our level editor. Obviously, the longer the operation, the less frequently it will be used and we will have to settle for results with less quality, because a lot of iterations are required for good tweaking.
Recently, I gave a lot of attention to GPGPU – using Graphics Processing Unit (GPU) power to accelerate our tools. Early versions of our level editor did lighting calculations (Ambient occlusion + direct lighting) by rendering hemicube side views from every texel of the lightmap. Unfortunately, experiments show that a simple CPU-only raytracer can do this task 2-3 times faster, because of a slow Videomemory to CPU transfer rate. Possibly, I could have achieved better results, if the whole algorithm was implemented on the GPU, but this would have required a lot of work due to the complex GPGPU programming model. Only recently, nVidia has provided a relatively simple C framework for GPGPU programming. Multicore CPUs have shown good perspective it this area. The second core, unlike a GPU, can execute the same binary code. There is no need to adapt algorithms for different programming models and languages to take advantage of additional performance.
Before I even started to optimize tools for multicore CPUs, I understood that I needed to think wider. :)
Before we continue, I will ask the reader to press CTRL+ALT+DEL, run the Task Manager and switch to the Performance tab. Take a look at the "CPU Usage" graph. I bet most readers will see values less then 10% there. On dual-core CPUs, the second core will be free most of the time.
During typical work, full computer power is used only in short bursts while generating responses to the actions of the user. When the user is typing text in Microsoft WordTM, pauses between keyboard events are comparable to eternity from the processor’s viewpoint. Indeed, the number of applications that fully utilize the available processing power is not very high.
Now count up The power that is available in a medium sized office (50+ workstations) and you will get a value that is comparable to modern supercomputers.
Unfortunately, the traditional relationship of cluster calculations to high science served as a mental barrier for me to attempt to use distributed computing in everyday tasks. Fortunately an excellent application, IncredyBuild has shown that distributed computing can be used not only for everyday, but also possibly for real-time tasks.
Data-parallel programming model
A number of workstations, connected by TCP/IP network, can be considered as a supercomputer with a distributed memory architecture. The most simple pattern for these systems is data-parallel computing.
In other words, if some ordinary application processes an array of independent elements, then in a data-parallel model, each processor is assigned to process some part of array. To support data-parallel computing, the core library should divide tasks into parts, transfer task data to the local memory of a particular CPU, run the task on that CPU, transfer results back to the caller, and provide the ability to request some global data from the caller.
This work is done by cluster support software. I have searched for available software, taking into account the following requirements:
- Workstations are running Windows XP;
- Only idle time is allowed for usage by cluster. The workstation user should not experience any slowdowns;
- High tolerance – the cluster is not dedicated. Any workstation can be rebooted, go offline or get busy by a user, but this should not affect the overall cluster stability and state.
The first requirement excluded most of the candidates. The reason is clear – why would someone buy a commercial OS for every cluster workstation, if there is freeware Linux?
The second and third requirements excluded all the other candidates, although I would like to mention some of the most promising libraries.
. This is a Windows implementation of the MPI library. Unfortunately, it runs hidden processes with normal priority, which affects the workstation user. If a workstation goes offline, the MPI session terminates with an error.
. Despite an easy programming model, I have put aside this candidate because there have been no new versions of this library for a long time.
. This is the most promising candidate. It provides an easy to use programming model, works on .NET framework, which in theory can allow even a Windows Mobile smartphone, like the Motorola MPX 200, to became a node in the cluster. It contains wonderful examples and continues to evolve. Unfortunately, during my testing, some examples crashed and left orphan processes on nodes.
So after my unsuccessful search, I made the decision to develop a library myself.
During development, I took into consideration the following additional requirements:
- TCP/IP 10-1000 MBit local network is used for data transfer;
- a network is considered internal and safe. i.e. the library does not contain any authentication protocols. A good security system would increase development time dramatically and make library usage more complex;
- "time to send task" / "time to complete task" ratio is 0.1 or more. Typical task data size can be 0.001-100MB;
- there is some workstation on the local network which is always online and can serve as an application coordinator (manager);
- the library will be used both for offline (up to several hours) and semi-real-time (up to several seconds) tasks;
- a single workstation can run only one grid application in any given period of time (this is a limitation of current version);
- for debugging purposes, a single workstation should be able to run agent, coordinator and grid applications at the same time;
- when a workstation goes online, it should be able to join already running grid sessions;
- when a workstation goes offline, this should not break overal process.
hxGrid just sends an incomplete task to other node;
- the cluster can be used from several workstations at once. The computational resources should be redistributed equally;
- the grid agent should be able to use all CPU cores on each workstation;
- the library should be available for use in C++ and Delphi.
To minimize development time, I chose to implement the library in Delphi. The library uses fake COM interfaces, the Jedy Visual Code Library and ZLIB.
The binary package should be downloaded to install a cluster.
The cluster software consists of three components:
Coordinator should be installed on a workstation that is always online.
Coordinator is responsible for maintaining a list of computational resources and distributing the list to grid users.
Coordinator should be installed on only one workstation in local network.
Agents and Users can find
Coordinator by sending broadcast packets on the local network, so that no configuration is required if
Coordinator is moved to a PC with a different IP address.
Users connect to
Coordinator only to get a list of agents. Later, users connect to agents directly, unlike other grid solutions where the coordinator distributes tasks among agents. This allows the fastest communication between user and agents.
Agents should be installed on every workstation in the local network.
Agent allows the idle CPU time of a workstation to be used.
User. A Grid User is some application written with hxGrid library.
Programming for hxGrid
hxGrid allows tasks (procedures) to be run on cluster workstations. A grid application places tasks into a queue and waits for their completion. The input data for tasks are written to a stream (
IGenericStream inStream). After successful execution of the task, output data is also received from a stream (
IGenericStream outStream). If necessary, an agent executing a task can request additional data from the grid application.
Every task is function with the following signature:
typedef bool (__cdecl TTaskProc)(IAgent* agent, DWORD sessionId,
IGenericStream* inStream, IGenericStream* outStream);
type TTaskProc = function(agent: IAgent; sessionId: DWORD;
inStream: IGenericStream; outStream: IGenericStream): boolean; cdecl;
is a unique session id. This Id is required to request additional data from a grid application (described later). The function should return
if interrupted (described later).
Such procedures should be placed into a DLL.
hxGrid delivers the code by transferring the DLL to the other workstation. The
TTaskProc function should also be thread-safe. After successful execution of the task by the agent, the library sends the output stream back to the user and calls
typedef void (__cdecl TFinalizeProc)(IGenericStream* outStream);
type TFinalizeProc = procedure(outStream: IGenericStream); cdecl;
This function should also be thread-safe.
If a single threaded application is processing a large array of independent elements, the obvious way to increase performance is to divide the array into small ranges and process each of them on cluster nodes.
How it works
When started, each agent looks for the coordinator, sending broadcast packets over the LAN. When a connection with coordinator has been established, the agent starts sending its status periodically (percentage of free CPU time, amount of free physical memory, number of queued tasks).
To begin cluster usage, the grid application initializes the
hxGrid library. The library connects to the coordinator (by sending broadcast packets, if required), requests the list of agents and connects to them directly. From this moment, the library is ready to execute remote tasks. At the same time, the library continues to request the list of agents, so new agents are able to join running sessions.
The grid application then adds tasks to the queue. The library continuously sends tasks to agents in a background thread. If a connection with some agent is lost, the library is able to reassign its tasks to other agent.
When the library is deinitialized, it disconnects from the agents.
To increase efficiency, the following technical decisions were applied:
- To compensate for the time lag of network communications, the library sends the next task to each agent earlier;
- Under some circumstances, the library can send a copy of a task already running on an agent to other free agents. This can help complete the job faster if an agent is slow or has no free CPU time at the moment;
- The library tracks the amount of free CPU time and free physical memory on the agents. The library will not send a task to an agent if it does not have enough free CPU time (percentage is specified in settings) or has less free physical memory than required to store the input stream (factor is configured in settings);
- The library controls the length of the task queue and size of the input streams. If the overall size of queued input streams exceeds a configured value, the library blocks application execution until at least one task is completed;
- If a special option is selected, the library can swap the task queue to disk to minimize memory usage;
- Before sending a large stream over network, it is compressed with ZLIB (size threshold is specified in the settings);
- The agent caches DLLs locally during the session; DLLs are not resent to execute same task again;
- If an agent does not have free CPU time for 10 seconds, this means that user has launched a resource intensice application. In this case, the agent suspends all working threads for 25 seconds. This is necessary, because otherwise the Windows scheduler will dedicate 120ms of CPU time to working threads with IDLE priority when CPU starvation is detected. If the threads would not have been suspended, this could lead to jerky FPS in games, for example.
PlayStation 2 cluster.
The library is distributed as two DLL's: hxGridUserDLL.dll and
zlib.dll. To start a session, the grid application should create the
IGridUser object (see examples\picalculator\C++ and examples\picalculator\Deplhi):
IGridUser* user = CreateGridUserObject(IGridUser::VERSION);
Uses T_GenericStream, I_GridUser;
From this moment, it is possible to execute tasks on the cluster nodes:
IGenericStream* stream = CreateGenericStream();
stream := TGenericStream.Create();
d := 1+i*9;
RunTask() method has the following parameters:
- dll filename with function code;
- symbolic name of function (function should be exported from DLL by name);
- input stream. Library is receiving ownership of the stream object;
- completion callback address;
- address of variable to receive unique id of task;
- blocking flag.
If the task can not be added to queue immediately (due to limitations of queue length or queue input streams size) and the blocking flag is set, the method will not return until the task is added to queue. Otherwise the method will return S_FALSE and the application can wait for an opportunity with
Please note that some dependent DLLs, for example VC++ runtime libraries, might be missing on a remote workstation. Buiding with static libraries is advised and always check dependencies with the
dumpbin utility. To send additional DLLs to a workstation, their names should be specified in the
RunTask() method, separated with commas:
The Task function should call the following agent method periodically, to allow the agent handle abortion and suspension events:
if (agent.TestConnection(sessionId)<>S_OK) then
result := false;
This is a very fast method and can be called from an inner cycle even with ~10000Hz frequency.
User->WaitForCompletion() is used to wait until all queued tasks are done.
To close a session, the application should destroy the
This short section has shown everything needed to use
Requesting additional data from user
In some circumstances tasks running on agents, must access some global data common to all tasks in the session. If the global data size is high, it is inefficient to pass global data in the task input stream. The agent provides the following method to request global data from the user:
virtual HRESULT __stdcall GetData(DWORD sessionId, const char* dataDesc,
IGenericStream** stream) = 0;
function GetData(sessionId: DWORD; dataDesc: pchar;
var stream:IGenericStream): HRESULT; stdcall;
dataDesc – symbolic identifier of data, for example "geometry";
stream – address of variable to receive address of stream with data (stream is created by agent, and should be freed by the task).
To handle requests, an
application should register the following callback:
typedef void (__cdecl TGetDataProc)(const char* dataDesc, IGenericStream** outStream);
type TGetDataProc = procedure(dataDesc: pchar; var stream: IGenericStream); cdecl;
User->BindGetDataCallback(callback: TGetDataProc); stdcall;
TGetDataproc should create the stream and fill it with data. Ownership of the stream is passed to the library. Usually a task is trying to access some global cache and if data is not there, requests data from the
hxGrid application. The cache should be indexed with
sessionId and a symbolic description of the data. These methods are used in examples\normalmapper-3-2-2 to request a hi-poly mesh octree.
PlayStation 2 cluster.
virtual void __stdcall GetSettings(TGridUserSettings settings);
virtual void __stdcall SetSettings(TGridUserSettings settings);
User.GetSettings(var settings: TGridUserSettings); stdcall;
User.SetSettings(var settings: TGridUserSettings); stdcall;
These methods allow the settings to be changed programmatically. For example, an application can disable stream compression if it fills streams with compressed data.
virtual HRESULT __stdcall CompressStream(IGenericStream* stream);
User.CompressStream(stream: IGenericStream): HRESULT; stdcall;
This method allows stream compression. For example, it is good practice to prepare and compress streams with global data ahead of time, so the library will not have to compress the stream each time
is called. The library is able to find out if a stream is compressed by examining its signature. Compressed streams should not be unpacked by the task, since the agent does it automatically. See examples\normalmapeer-3-2-2
virtual HRESULT __stdcall FreeCachedData(DWORD sessionId, const char* dataDesc);
HRESULT __stdcall IAgent::FreeCachedData(DWORD sessionId, const char* dataDesc) = 0;
The agent is caching global data requested during session. This method allows freeing the cache to minimize memory usage, if it is known that data will not be requested again during the session. For example, if a task requests global data, builds some structures from it, and stores them in some custom global cache, then other tasks can use structures from this cache and not call
during again during the session.
typedef void (__cdecl EndSession)(IAgent* agent, DWORD sessionId);
type TEndSessionProc = procedure(agent: IAgent; sessionId: DWORD); cdecl;
The agent calls the
callback at the session's end. Usually this callback is used to free the global session data cache. The callback method should be exported by name from the DLL which was specified in
. This callback is optional.
There are two patterns for using the library.
- Tasks are added to the queue with
IGridUser->RunTask(blocking = true), then the application waits for completion with
- Tasks are added to the queue with
IGridUser->RunTask(blocking = false). If method returns
S_FALSE because the queue is full, the application waits for the queue to progress with
IGridUser->WaitForCompletionEvent(). When all tasks have been submitted, the application waits for tasks completion by calling
for s:=1 to 200 do
if state=ST_CANCELING then
if state=ST_CANCELING then break;
while (bl) do
if (state=ST_CANCELING) then
An application can cancel all tasks at any time by calling
Despite being more complex, the second pattern has the following advantages:
- the application can perform some tasks while waiting,
- there is the ability to cancel tasks, and
- the ability to show progress in the UI.
For details of the second pattern, see examples\GridGMP\.
Examples of library usage
PI Calculator (examples\PICalculator)
This is the simplest example and is a
Pi value calculation with specified accuracy. Each task is calculating
A Mandelbrot fractal explorer. There are three versions of the application:
– this example uses the FPU native 10-byte floating point type and runs on a single CPU. Lack of accuracy does not allow high zoom levels.
– this example uses 256-byte
on single CPU. A Delphi port of the GMP library is used.
– the example uses 256-byte
. Each agent is assigned to calculate one vertical line of the image. This application also shows how badly partitioned calculations can decrease overall performance: vertical lines can require very different amount of processing power, up to 50:1 ratio. A better solution would be to assign each agent to calculate every N’s pixels of image’s horizontal scanning.
3.2.2, ported to
Normalmapper is an ideal application for distributed computing. Each texel of a normal map can be calculated independently from the others. Total time to calculate a normal map with an occlusion term and bent normals enabled, can take up to 12 hours. The modified version forms tasks to calculate 100 pixels of a normal map. Each agent requests a serialized hi-poly mesh octree by calling
. Octree size can be up to 300Mb, so sending it in the task input stream is inefficient. In addition, the modified version contains a mutex, which allows running only one instance of
at a time. It is possible to run several instances with different models and they will run in queue automatically.
It should be mentioned that the library does not guantantee the order of task completion. Since it is important to calculate texels in a strict order because of intersecting mapping and border texels, the application perform special procedures to update normal map texels in the required order.
How to install ATI
A couple of PCs running Windows 2000/XP SP1/SP2/Vista, connected via local network.
hxGrid Coordinator and install on any workstation in the local network.
Coordinator should be installed only on one workstation in the local network. This PC should always be available to coordinate
hxGrid Agent and install on all workstations in the local network. The more PCs that are running the agent, the faster the
hxGrid apps will work.
- Download ATI
hxGrid and install on any PC in the local network.
No additional configuration is required.
Hint: If everything above is installed on single a workstation and it does not have network adapter with IP address, the "Microsoft loopback adapter" can be installed and assigned an IP address.
- Intel Pentium D Presler 3.0GHz
- Intel Pentium 4 2.8 GHz (HT)
- Intel Core 2 Duo 1.8GHz
- Intel Core 2 Duo 2.13GHz
- Athlon X2 3600+
- LG GE notebook (Intel Celeron M 1.4 GHz)
Measurements were made for calculation of car_low.nmf + car_high.nmf meshes, 4096 x 4096, Bent normals, Ambient occlusion.
Normalmapper.exe –on carlow.nmf carhigh.nmf 4096 4096 test.tga
For experiments, the original version of ATI
Normalmapper was recompiled with VS 2005 with aggressive optimizations (no PGO).
Experiments show that
hxGrid can be successfully used to take advantage of multicore CPUs even on a single workstation (agent, coordinator and grid application is running on single workstation).
Normalmap calculation, hours
Performance improvement, times
In other words, calculation of normalmap on the cluster took 8 minutes, comparable to almost 4 hours on a single PC (performance improvement = 27.4 times). The HT CPU showed an expected 12% speed improvement, but I can’t explain the 35% speed up of Core 2 Duo. (I expected ~80%). Probably, the bottleneck here is the shared L2 cache, since the Pentium D Presler showed a 73% speedup.
State of cluster (end of working day)
Because of high energy consumption, the cluster room should be equipped with ventilation and cooling. How much does the energy consumption of the office increase while agents are active?
Energy consumption of single node, Watt
100% processor load increases energy consumption approximately by 30 Watt. Accuracy of my measurements can be questionable, but even in an office with 50 workstations, the overall energy consumption increases only by 1.5KW, which is less than a good teapot (1.8KW).
The following problems have not been solved yet:
- disconnection from grid can take up to 30 seconds, due to an iniability to close a connection on the server side ("features" of
TServerSocket Delphi component). This is not a problem when using Distributed
Normalmapper, but can become a problem when
hxGrid is used to solve semi-realtime tasks. If someone can help me with a solution, I would appreciate it.
EndSession() callbacks are not yet implemented;
- There is no easy error handling and debugging support.
Public applications using
xNormal is using
hxGrid to accelerate normalmap generation.
- 1. General-Purpose Computation Using Graphics Hardware, http://www.gpgpu.org/
- 2. Ambient occlusion - From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Ambient_occlusion
- 3. NVIDIA CUDA Homepage, http://developer.nvidia.com/object/cuda.html
- 4. Incredibuild by Xoreax software, http://www.xoreax.com/main.htm
- 5. Globus Toolkit Homepage, http://www.globus.org/toolkit/
- 6. Кластерная система Condor, http://www.osp.ru/os/2000/07-08/178077/
- 7. MPICH2, http://www-unix.mcs.anl.gov/mpi/mpich/
- 8. libGlass – distributed computing library, http://libglass.sourceforge.net/download.php
- 9. Alchemy – distributed computing library, http://www.alchemi.net/index.html
- 10. hxGrid binaries, http://sourceforge.net/project/showfiles.php?group_id=236427
- 11. Программирование с использованием COM-подобных интерфейсов, http://www.dtf.ru/articles/read.php?id=44995
- 12. То, что вам никто не говорил о многозадачности в Windows, http://www.dtf.ru/articles/read.php?id=39888
- 13. The GNU MP Bignum Library, http://gmplib.org/
- 14. ATI Normalmapper, http://ati.amd.com/developer/tools.html
- 15. Jedy Visual Code Library, http://homepages.borland.com/jedi/jvcl/
- 16. PlayStation 2: Computational Cluster, http://arrakis.ncsa.uiuc.edu/ps2/cluster.php
- 17. Unmodified Xbox Cluster, http://www2.cs.uh.edu/~bguillot/xbox/home.html
- 18. Информационно-аналитический центр parallel.ru, http://www.parallel.ru/
- 19. Распределенные вычисления: поиск лекарства от рака, http://www.3dnews.ru/reviews/software/cure-for-cancer/
- 20. РАСПРЕДЕЛЕННЫЕ МОЗГИ, http://www.fuga.ru/articles/2003/01/distributed.htm
- 21. Взлом NTV+ с помощью распределенных вычислений, http://www.xakep.ru/post/20600/default.asp
- 22. Знаете ли вы, что большинство времени ресурсы компьютера используются менее чем на 5%?, http://distributed.ru/?what-is
- 23. Распределенные вычисления с минимальными затратами, http://msk.nestor.minsk.by/kg/2002/07/kg20708.html
- 24. Распределенные вычисления на FreePascal под Windows, http://freepascal.ru/article//raznoe/20051207110629/
- 25. Распределенные вычисления - паразитные вычисления, http://center.fio.ru/method/resources/judina/05-03/news/parazit.htm
- 26. Sony рассматривает возможность продавать процессорное время PS3, http://www.gamasutra.com/php-bin/news_index.php?story=13476
- 27. ZLIB, http://www.zlib.org
- 28. NVIDIA Texture Tools 2 Alpha, http://developer.nvidia.com/object/texture_tools.html
- 29. hxGrid source code, http://sourceforge.net/projects/hxgrid/
- 30. xNormal, http://www.xnormal.net/
- 31. Author’s page, http:/www.deep-shadows.com/hax/
- Version 1.09d - fixed critical bug: unable to complete last few tasks, if task takes more then 20 sec;
- Version 1.09c -
agent were changed in this release.
- new option: '
allowDiscardCoordinatorIP' - see hxgrid.ini for description;
- fixed bug:
hxGridUser was unable to run tasks, if single task size is larger than allowed memory usage for
- Version 1.09b - added method
- Version 1.09a - added Windows 2000 support;
- Version 1.09 - This is a very stable release with Windows Vista support.
- fixed critical bug: memory corruption in agent;
- fixed critical bug in scktcomp.pas Delphi component (thread-safety issue);
- fixed critical bug: agent is unable to load task dll in Windows Vista;
- fixed thread-safety bug in
Release() should use
- fixed memory leak in
IdUDPServer not freed);
IGridUser->SetSettings() .h headers;
IGridUser->isComplete(var complete:boolean) now returns
TRUE when no task is running (was:
- hxgriduseddll.dll updated in
PICalculator C++ example;
AutoUpgrade example: unable to upgrade due to relevance to current directory;
- fixed memory leak in
DebugWrite bug: exeption on invalid strings causes agent to lost connection;
- fixed possible thread-safety issue in
- new method:
- new method:
TAgentSettings in IAgent.h;
- IAgent.h and I_Agent.pas documented in English;
Normalmapper_hxGrid_Setup installs 3DS MAX and Maya NMF export plugins;
- Version 1.08a - article translated to English;
- headers are documented in English;
- cancellation support;
- fixed bug in
- fixed bug: freezing on 99.99%;
- fixed bug: lost tasks if
- fixed bug:
hxGrid application is unable to work, if workstation has not been rebooted more than 25 days;
- settings tuned for large
GridGMP sample (all drawing is done in main thread);
- network and memory overload tracking: at the start of the session, too many agents request data from user. Library tracks network and memory load;
- GridGMP example shows cancellation support and progress display;
Normalmapper installer installs NMF export plugins to plugin directories of 3DS MAX and Maya;
- more accurate progress display in ATI
NMFExport plugins for Maya 7.0, 8.0 and 8.5 included;
Note: agent and coordinator are modified, but have the same interace versions. Upgrade recommended (see Examples\Autoupgrade\);
- Version 1.08 beta - initial public release;