Codemasters is an award-winning game developer and publisher, with popular game brands like DiRT*, GRID*, Cricket*, and Operation Flashpoint*. With GRID 2, Codemasters wanted to deliver a compelling high-end experience on 4th generation Intel® Core™ processors even on low power Ultrabook™ systems. On top of that, GRID 2 includes power-friendly features to improve and extend the gaming experience when playing on the go with an Ultrabook device.
Codemasters collaborated with Intel to make the most of the wide range of performance options available in systems running 4th generation Intel Core processors. As a result, Codemasters shipped GRID 2 with fantastic visual quality, increased performance, and significant improvements in power management and mobile features. The game looks and runs its best on Ultrabook devices with 4th gen Intel Core processors. GRID 2 uses two advanced features that are only made possible using the new Intel® Iris™ Graphics extension for pixel synchronization. With pixel synchronization, GRID 2 uses adaptive order independent transparency (AOIT) on the game’s foliage and adaptive volumetric shadow mapping (AVSM) for efficient self-shadowing particles. With both features together, the GRID 2 game artists had greater control than ever to create an immersive world in the game. With GRID 2 running on PCs with 4th gen Intel Core processors, gamers have a high-performance experience that looks fantastic and plays great..
4th generation Intel Core processors bring big gains for GRID 2
With the introduction of 4th gen Intel Core processors, Intel delivered several technology advances that Codemasters had been looking for. With the processor’s advancements in graphics technology, improved CPU performance, and the Intel Iris Graphics extensions to DirectX* API, Codemasters had the basis for outstanding features and performance in GRID 2, as well as a strong collaboration with Intel.
Bringing a game like GRID 2 to life takes the work of many, but a few people at Codemasters and Intel were deeply involved in the GRID 2 co-development. Special thanks to Toby Evan-Jones (Producer), Robin Bradley (Lead Graphics Programmer), and Richard Kettlewell (Senior Graphics Programmer) at Codemasters, and Chris Seitz (Account Manager), Leigh Davies (Application Engineer) and Filip Strugar (Application Engineer) at Intel. Without them, this collaboration would not have happened.
With the graphics extensions, game developers have two new DirectX 11 extensions at their disposal, supported on 4th gen Intel Core processors.
- The first is the Intel Iris Graphics extension for instant access, which lets the graphics driver deliver a pointer to a location in GPU memory that can also be accessed directly by the CPU. Previously, accessing the GPU’s memory would have resulted in a copy, even though the CPU and GPU share the same physical memory.
- The second extension is the Intel Iris Graphics extension for pixel synchronization, which enables programmable blend operations. It provides a way to serialize and synchronize access to a pixel from multiple pixel shaders, and guarantee that pixel changes happen in a deterministic way. The serialization is limited to directly overlapping pixels, so performance remains unchanged for the rest of the code.
Since the extensions were new for 4th gen Intel Core processors, they hadn’t been used in a shipping game before, so we set out to learn the very best ways to use these extensions for GRID 2.
Codemasters was also interested in the power improvements that the 4th gen Intel Core brings to PC gaming. With longer battery life and better stand-by times, the Ultrabook platform makes an even more compelling gaming environment. Historically, Codemasters did not optimize for power efficiency. With GRID 2, they consistently deliver equivalent or better visuals, while using less power. GRID 2 players win, with longer play times on battery.
Getting it in tune: Foliage and particles needed help
Codemasters wanted GRID 2’s graphics to shine on Intel systems, but we had some challenges. To make the game as realistic as possible, we used a particle system for smoke and dust effects from the tires. The tire smoke originally cast a simple shadow on the track, but the smoke effect didn’t shadow itself and had no proper lighting. It relied on artist-created fake lighting, baked into the textures. For years, the artists at Codemasters have been asking for more realistic lighting for their particle systems, but the performance implications had always made it prohibitive. We knew there were better options, and the new processor has given them to us.
Figure 1. Smoke particles before optimization
In addition to the game’s signature city racing circuits, GRID 2 has several tracks that pass through countryside, featuring dense foliage along the track. This foliage needs to combine with complex lighting to make the racing environment’s atmosphere feel realistic and immersive. Foliage needs to use transparency along its edges to appear realistic and avoid pixel shimmer, especially on moving geometry. In order to render transparent geometry correctly, you must render it in a specific order, which can be impractical in complex real-time scenes like those in GRID 2. An alternative is to use Alpha to Coverage, but that requires multisample anti-aliasing (MSAA), which comes with a performance cost and still has artifacts compared to correct alpha blending.
Figure 2. Foliage before optimization, showing detail on the bottom image
Existing solutions to these challenges require a discrete graphics card, and often run brutally slow since they are very computationally heavy. Codemasters needed solutions that were as efficient as possible, and Intel delivered.
Pixel Synchronization: Bringing new performance to existing algorithmsBoth self-shadowing of particles and correct foliage rendering have one thing in common, they are problems that require data to be sorted during rendering. Shadows must be sorted with respect to the light source, and the foliage must be sorted relative to the viewer. One solution is to use DirectX 11 and unordered access views (UAVs). Because of limitations with the way atomic operations can be used, however, the algorithms either require unbounded memory or can result in visual artifacts when memory limits are reached. UAVs also can’t guarantee that each frame will access pixels in the same order, so some sorting is required in order to prevent visual artifacts between frames.
The Intel Iris Graphics extension for pixel synchronization gives graphics programmers new flexibility and control over the way that the 3D rendering pipeline executes pixel shaders. Intel researchers used this capability to design algorithms that solve three long-standing problems in real-time graphics:
- Order-independent transparency
- Anti-aliasing of complex scene elements such as hair, leaves, and fences
- Shadows from transparent effects such as smoke
Unlike previous approaches, Intel’s algorithms with pixel synchronization use a constant amount of memory, perform well, and are robust enough for game artists to intuitively use them in a wide range of game scenes. Because pixel synchronization also guarantees any changes to the UAV contents are always ordered by primitive, they’re consistent between frames. This means that games can now use order-dependent algorithms. Intel published earlier versions of these algorithms in the graphics literature two to three years ago, but they have not been practical to deploy in-game until the advent of pixel synchronization on 4th gen Intel Core processors. The published algorithms are called adaptive order-independent transparency (AOIT) and adaptive volumetric shadow maps (AVSM).
Smoke particle shadow and lighting: Using pixel synchronization for AVSM
The smoke particle effects are central in GRID 2, so this was an obvious place to apply AVSM. With this feature added, the smoke particles realistically cast shadows on themselves and the track. Artists have greater control over how the particles are lit and shadow themselves, so they have great visual impact.
"The artists working on 'GRID 2' have been requesting this type of effect for years, and prior to this, it wasn't possible to achieve it at a reasonable cost," said Clive Moody, senior executive producer at Codemasters Racing*. "The fact that this capability will be available to millions of consumers on forthcoming 4th generation Intel Core processors is very exciting to us."
A PC-only particle system showcases this result.
Figure 3. Smoke particles with AVSM, showing self-shadowing
Because AVSM combines transparent results in a space-efficient way, there is some compression. You might think that AVSM could introduce unacceptable compression errors, but in practice, visual quality is very good. More importantly, the effect is deterministic since the pixel synchronization ensures pixels are committed in the same order on each frame. This avoids problems with shimmering and flickering that can be introduced by related techniques.
The first implementation of AVSM in GRID 2 used 8 nodes, and performed all lighting calculations on a per-pixel level using the resolution of the current particle system (normally smaller than the actual screen size). Bilinear sampling smoothed out artifacts when viewing a stationary smoke plume in a replay camera. This first implementation was fast enough in game on higher end systems with Iris Pro Graphics, but with cars having multiple emitters (4+ per car) it took 8 ms to create a shadow map and up to 18 ms to resolve for each. This gave a worst-case of about 100 ms per frame for adding AVSM, so improvements were needed if this feature was to be enabled by default.
The AVSM node itself was improved, so that 4 nodes could be used instead of 8 with no noticeable visual change. On top of that, a major improvement in performance and quality came from adding vertex shader tessellation, with per-vertex lighting. This avoids sampling the AVSM data structures at a more expensive per-pixel level. GRID 2 implements screen space tessellation in the domain shader and then uses faster per-vertex lighting evaluation to sample the shadow map. By using screen space tessellation, we ensure that large particle quads near the front of the screen are broken down into smaller triangles, while small or distant particles are left relatively untouched. The results are nearly identical visually, and performance is improved, especially for the worst-case scenarios such as replaying while focusing on the car doing a wheel spin.
Once particle self-shadowing was added, it became clear that the individual particles weren’t sorted correctly when drawn on the screen. Originally, the game had sorted particles back-to-front within an emitter, so the transparent particles would render correctly. With multiple emitters per car, however, it was possible for far smoke plumes to be drawn on top of near ones.
Figure 4. Problem - unsorted smoke particles with AVSM, with far smoke plumes on top of near ones
This wasn’t a problem before because the original art was uniform. At first, we planned to solve this with pixel synchronization. We created a working version of the AOIT algorithm (described below) to do this, but since the particles are all screen-space-aligned, they can simply be sorted on the CPU instead. This was faster than a pixel synchronization solution, since it used spare performance on the CPU.
The final piece of the lighting puzzle was to integrate the AVSM shadow system with Beast* lighting from Autodesk. Beast lighting is used to light the rest of the geometry, which means the AVSM shadow map must pick up the recalculated lighting data, so that smoke trails will darken under bridges or pick up light sources around the edge of the track.
While AVSM still has a run-time cost, after optimizations it was well within the budget for visual impact. The worst-case scenario was sped up almost 4x. Typical performance is about 0.7 ms per shadow cascade with a 0.4 ms resolve stage, using about 200K pixels on a quarter screen render target. AVSM is enabled by default on high presets; the algorithm can also be switched off and on with the Advanced Settings menu on any 4th gen Intel Core processor-based system.
Foliage transparency: Using pixel synchronization for AOIT
Codemasters’ racing titles have a long history of attractive outdoors scenery, with the DiRT franchise pushing artistic boundaries creating realistic off-road environments. While GRID 2 doesn’t go off-road, there are still plenty of tracks that show off stunning point-to-point circuits.
Figure 5. he Great Outdoors, showing off the stunning scenery
Codemasters wanted their artists’ work to shine. Transparency on the foliage edges is one part of creating a realistic look and feel. Originally, the only way to get soft edges was to use Alpha to Coverage with high levels of MSAA enabled. This ran very slow, and Alpha to Coverage doesn’t provide depth to densely packed trees. Codemasters turned to AOIT to get the transparent edges of the foliage looking their best, while also running faster and improving the look of the dense forest sections. No changes were required to the art pipeline.
Figure 6. Foliage with AOIT, showing soft edges in the detail on the bottom image
It took about 5 ms to render the trees in an area of the track with heavy foliage, which was a significant chunk of a frame. When it was first implemented, AOIT pushed that to 11 ms. This approached the time to run MSAA, so this was too long. Optimizations reduced this significantly.
The initial AOIT implementation used 4 nodes to store the transparency information. It also used a complex compression routine (similar to the one used for AVSM) that took into account the difference in area beneath a visibility graph. Experiments showed that for typical scenes sorted relative to the viewer, a much simpler algorithm could be used since the depth played a smaller part in the visibility decision. Further experiments showed that 2 nodes were enough to store that data. This allowed both color and depth information to be packed into a single 128-bit structure, rather than separate color and depth surfaces. AOIT’s performance was further improved by using a tiled access pattern to swizzle the elements of the UAV data structure, making memory access more cache-friendly. In total, this nearly doubled the performance of AOIT, bringing it down to 2-3 ms on complex foliage heavy scenes and much less on scenes with light foliage.
While AOIT proved a good solution for the complex foliage, it still presented some issues. Ideally, all transparent objects would get rendered with the same AOIT path. This would have been expensive since some transparent objects like god rays were already alpha-blended to a large part of the screen and rendered with a traditional back-to-front pass. Combining the two techniques initially created draw-order problems, since it’s difficult to combine traditional back-to-front transparency rendering with AOIT.
We wanted to keep the efficiency of the back-to-front render for objects that could easily be sorted, while gaining the flexibility of using AOIT on complex intersecting geometry. The solution turned out to be fairly elegant. First, render AOIT without resolving to the screen. Then, execute a back-to-front traditional pass of transparent objects. Anywhere a traditionally rendered object interacted with a screen-space pixel from the AOIT pass, that object was added to the AOIT buffer instead of being rendered. Finally, they’re all resolved. This approach works great, as long as the AOIT objects don’t cover a large part of the screen at the same time as a standard object. This approach allowed ground coverage and god-rays to correctly interact with the tree foliage with only a minimal performance impact. In the end, the AOIT became so efficient it was added to other objects that suffered from aliasing, such as the chain link fences. This allowed for thin geometry to fade out into the distance gracefully, rather than becoming noisy and aliased.
Figure 7. Fences on the left show aliasing in the distance, AOIT improves fences on the bottom image
At first, AOIT didn’t work right when MSAA was also enabled. AOIT needs to account for pixels rendered at higher sample frequency at triangle edges. It’s not enough to simply add partially covered pixels into the AOIT buffer with a lower alpha value since they won’t blend properly. These pixels have to be handled separately, adding to the time to compute them. Otherwise, they can reinforce each other and give a double darkening around edges. The solution for GRID 2 was to do this partially, to get the right balance between correctness and compute time.
AOIT is enabled at Medium quality settings and above, and it can be switched off and on with the Advanced Settings menu. GRID 2 uses Medium quality settings by default on all 4th gen Intel Core processors.
Instant access: Lessons learned
The 4th generation Intel Core processors brought two new extensions to DX11 graphics. Pixel synchronization was heavily used in GRID 2. What about instant access?
Instant access provides access to resources in memory shared by the CPU and GPU. Since GRID 2 already used direct memory access on the consoles, at first we assumed it would be easy to also use on the PC. Systems, like particles, ground cover, crowd instance data, and crowd camera flashes, all accessed the vertex data. Instead of giving an immediate speedup, instant access actually introduced stalling in the render pipeline. DirectX was still honoring the buffer usage and would wait to unlock the resource if it was already in flight to the graphics engine.
We could have added manual double-buffering to work around this, but we realized that the driver was already doing a good job optimizing its usage on the linearly-addressed memory, so we weren’t likely to see a large speedup. As a result, instant access wasn’t used in GRID 2.
We talked about a few ideas that could have given performance boosts, like using instant access for texture memory. GRID 2 doesn’t stream the track data, and only a small number of videos are uploaded during a race, so we didn’t expect a large gain. After that, we focused our attention on pixel synchronization since we had such obvious benefits from that extension in this game.
Your game may take advantage of instant access in several ways. Instant access might give faster texture updates from the CPU (working on native tiled formats), since your game will avoid the multiple writes that come when the reordering data for the driver. Or you may find major gains accessing your geometry if you have a lot of static vertex geometry with small subresource updates per frame.
Try it out, and see!
Anti-aliasing: Big improvements
Anti-aliasing helps games look great. Multi-sample anti-aliasing (MSAA) is commonly used and supported by Intel graphics hardware, but it can be expensive to compute. Since GRID 2 has a very high standard for visual quality and run-time performance, we weren’t satisfied with performance trade-offs for enabling MSAA, especially on Ultrabook systems with limited power budgets. Together, Intel and Codemasters incorporated a technique we’ll call conservative morphological AA (CMAA).
While you should look for full details on CMAA in an upcoming article and sample, we’ll outline the basics. As a post-process AA technique, it’s similar to morphological AA (MLAA) or subpixel morphological AA (SMAA). It runs on the GPU and has been tailored for low bandwidth with about 55-75% the run-time cost of 1xSMAA. CMAA approaches the quality of 2xMSAA for a fraction of the cost. It does have some limited temporal artifacts, but looks slightly better on still images.
For comparison, at 1600x900 resolution with High quality settings, enabling 2xMSAA adds 5.0 ms to the frame, but CMAA adds only 1.5 ms to the frame (at a frame rate of 38.5 FPS). CMAA is a great alternative for gamers who want a nicely anti-aliased look but don’t like the performance of MSAA.
Figure 8. Original garage on the left shows some aliasing, better with CMAA applied on the bottom image.
Because CMAA is a post-processing technique, it also works well in conjunction with AOIT, without suffering from the sampling frequency issues discussed above.
SSAO: A study in contrasts
GRID 2 contains screen-space ambient occlusion (SSAO) code that runs great on some hardware, but didn’t run as well as we’d like on Intel® hardware. There are different SSAO techniques, and GRID 2 originally used high definition ambient occlusion (HDAO). When we first studied it, it took 15-20% of the frame, which was far too much.
The original SSAO algorithm uses compute shaders, but CS algorithms can sometimes be tricky to optimize for all variations of hardware. We worked together to create a pixel shader implementation of SSAO that performs better in more cases.
Figure 9. SSAO turned off on the left, SSAO turned on and running in a pixel shader on the bottom image.
The CS implementation relies heavily on texture reads/writes. The PS implementation uses more computation than texture reads/writes, so it doesn’t use as much memory bandwidth as the CS implementation. As a result, the PS version of SSAO runs faster on all hardware we tested and runs significantly faster on Intel graphics hardware. While the new version is the default, you may choose either SSAO implementation from the configuration options.
Looks great, less battery: Minding the power gap
More gamers than ever play on the go. This poses some special challenges for game developers. To help players keep an eye on their charge while playing, GRID 2 displays a battery meter on-screen. Codemasters used the Intel® Laptop and Netbook Gaming Technology Development Kit to check the platform’s current power level and estimated remaining battery time. When you’re running on battery power, that information is discreetly shown as a battery meter in the corner of the screen.
When playing on battery, the CPU and GPU workloads each contribute to the overall power use. This makes it a careful balancing act to optimize for power since changes to one area may affect the power use of the other.
First, we optimized any areas where extra work was being done on the CPU that didn’t affect the GPU. For example, there were some routines that converted back and forth between 16-bit floats and 32-bit floats. Those routines used simple reference code, but after study, we replaced them with a different version that ran much faster.
Another CPU power optimization came from the original use of spin locks for thread synchronization. This is very power inefficient; it keeps one CPU core running at full frequency, so the CPU’s power management features cannot reduce the CPU frequency to save power. It can also prevent the operating system’s thread scheduler from making the best thread assignment. Several parallel job systems were rewritten, including the CPU-side particle code. They were changed to reduce the amount of cross-thread synchronization.
One of the best power optimizations that can be done on a mobile platform is to lock the frame rate to a fixed interval. This lets both the CPU and GPU enter a lower power state between frames. Since GRID 2 was already optimized around a target of 30 FPS on default settings, it wouldn’t have had much effect if we had simply set a 30 FPS frame rate cap. Instead, there’s a special mode added to the front-end options. If power saving is enabled, the game will reduce some visual quality settings when the user is running on battery. Since none of the setting changes require a mode change, they can happen seamlessly during play. These changes raise the average frame rate above 30 FPS, so a 30 FPS frame rate cap is now effective at saving power and prolonging game play on battery.
Finally, the game’s built-in benchmark now uses power information. When profiling the game over a single run, GRID 2 logs power and battery information as the benchmark loops. If you study these results over time, you can see how power-efficient your current settings are on your benchmark system.
Working together, Intel and Codemasters found ways to deliver a fantastic game that looks and runs great on Intel’s latest platforms.
Now that they can be built on top of pixel synchronization, AVSM and AOIT bring new levels of visual impact along with great performance. Together, they enrich the game environment and give a greater level of immersion than ever before.
The addition of CMAA brings a new option for high-performance visual quality. Moving SSAO to a pixel shader helps the game run faster. After optimizing usage of the DirectX API with more efficient state caching, optimizing float conversion routines, removing spin locks, and automatically adjusting quality settings and capping the frame rate, the game gets the most out of your battery. GRID 2 also helps gamers keep track of their battery power when they’re playing on the go.
Adding those together, GRID 2 looks and runs great on Intel’s latest platforms. Consider the same changes in your game!
Latest AVSM paper and sample: http://software.intel.com/en-us/blogs/2013/03/27/adaptive-volumetric-shadow-maps
Original AVSM paper and sample:http://software.intel.com/en-us/articles/adaptive-volumetric-shadow-maps
AOIT paper and sample: http://software.intel.com/en-us/articles/adaptive-transparency
Laptop and Netbook Gaming TDK Release 2.1: http://software.intel.com/en-us/articles/intel-laptop-gaming-technology-development-kit
4th Generation Intel® Core™ Processor Graphics Developer Guide:http://software.intel.com/en-us/articles/intel-graphics-developers-guides
About the author
Paul Lindberg is a Senior Software Engineer in Developer Relations at Intel. He helps game developers all over the world to ship kick-ass games and other apps that shine on Intel platforms.
Intel, the Intel logo, Core, and Ultrabook are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.