Click here to Skip to main content
Click here to Skip to main content

Whole Program Optimization with Visual C++ .NET

By , 10 Dec 2001
 
<!-- Add the rest of your HTML here -->

Introduction

The charter of compiler optimization has always been to produce the fastest running programs possible. Developers trying to write performance-tuned programs are in a continuous endurance trial of writing code in ways that lead to optimization opportunities for the compiler. Historically, compilers have introduced scalar optimizations that work only on isolated pieces of a program, usually only inside functions. Visual C++ .NET goes to an entirely new level with Whole Program Optimization. This article discusses everything that the compiler can do using this new framework for optimization and how little the developer must do.

Before Visual C++ .NET

The first things that a person typically learns in a C++ course are using the compiler to compile code and understanding that the compiler is responsible for creating object files for each source file. The linker then brings all object files together into something useful, such as an executable or a DLL. From the beginning, the compiler is at a disadvantage - it can only see a small piece of the program at any point in time. Unable to see other pieces, the compiler must use a conservative approach that results in slowing a program down. A classic example of this is calling conventions.

The interfaces to each module of a program need to remain consistent. Common calling conventions are cdecl, stdcall, and fastcall. Mixing and matching calling conventions inside a program was possible, but it required the developer to annotate the function signature with keywords. The developer was not necessarily the best person to make the decision for what would be the best calling convention for each function. The compiler could not really make this decision either, however, without breaking the interface to other modules. There are many similar examples where programs could be improved if the compiler had access to the whole program. For example, inlining could only happen inside individual object files. This program generates two unnecessary functions:

myclass.h
class MyClass {
private:
     int i;
public:
     void set_i(int n);
     void print_i();
};

 

myclass.cpp
#include <stdio.h>
#include "myclass.h"

void MyClass::set_i(int n) {
     i = n;
}

void MyClass::print_i() {
     printf("\"i\" is %i.\n", i);
}

 

main.cpp

#include "myclass.h"

int main(int argc, char* argv[])
{
     MyClass myclass;

     myclass.set_i(42);
     myclass.print_i();

     return 0;
}

Both the functions set_i and print_i are inline candidates. Unfortunately, when the compiler is working on main.cpp, it does not have access to the implementation in myclass.cpp. Developers can work around this by putting inline candidates in the header file, but it is better coding practice to leave the header file free of implementation details. In addition, not every user inline candidate should be inlined. This is also true for functions not marked with the inline keyword; some of these are great inline candidates. Again, the compiler is always at a disadvantage because it does not have access to the entire program.

With Visual C++ .NET

Link time code generation (LTCG), the Visual C++ .NET framework that makes whole program optimization possible, mitigates the difficulty a compiler has in performing optimizations. As the name implies, code generation does not occur until the linking stage. The steps that the compiler uses during an LTCG build can be summarized as follows:

  1. The compiler takes each source file and does the usual parsing and type checking. It then generates intermediate representations of the source file and shuffles that off to the optimizer and the code generator.
  2. Instead of optimizing the intermediate representation, as it would normally do without LTCG, the compiler puts the intermediate representation in an object file. Note that the compiler basically does nothing to the code. Instead of containing assembly language, the object file has a higher level view of the program.
  3. The linker now starts as usual trying to pull all the object files together to form a program. Because the object files do not contain assembly code, the linker must invoke the compiler to finish the job of compiling the code. The linker has the compiler optimize and generate code for one function at a time. The compiler can ask the linker for information about other parts of the program and thus make informed decisions rather than always assuming the worst case.

The linking stage will take longer than usual, but the compiling stage will be much faster. Also note that the object files produced by the compiler through LTCG are not as portable as object files that contain assembly code. The intermediate representation stored in LTCG object files is likely to change with each version of Visual C++, so these object files would need to be regenerated every time that the compiler is upgraded. This situation only presents itself if the developer is trying to produce a .lib file. For that reason, unless the plan is to regenerate a new library for each future version of Visual C++, publicly distributing static-link libraries using LTCG for the object files is not recommended. Another consequence of including intermediate representations of the code in the object files, rather than assembly code, is that tools such as dumpbin.exe and editbin.exe do not work.

Optimizations Available to Whole Program Optimization

Cross-module inlining

As the previous example showed, cross-module inlining is perhaps the best reason to use whole program optimization. Instead of placing implementation details in header files, developers can now keep things neatly organized in an appropriate source file. It is not necessary to mark functions with the inline keyword, because the compiler can determine if it is beneficial to inline that function. This will happen when using the /Ob2 switch, which is implied by both /O1 and /O2. Sometimes, the release build in Visual Studio .NET will include /Ob1 on the command line; to enable cross-module inlining, do not include /Ob1, which only allows user-declared inline candidates to be inlined.

Cross-module bottom-up information

Often, the individual optimizations that the compiler can do are completely safe, but the information about the program is too conservative, and the compiler opts to not do the optimization in favor of accuracy. The compiler always generates information from the bottom of the call-tree. With whole program optimization, the scope of the information includes the entire program including information collected about each function’s register usage, memory usage, and information to improve inlining heuristics. With accurate information, the compiler does not need to make pessimistic decisions about whether a certain optimization is done.

Region based stack double alignment

Just as integers and pointers should be 4-byte aligned, doubles should be 8-byte aligned. By default, the stack in Win32 s 4-byte aligned. Misaligning data types results in significant performance loss. Without whole program optimization, the compiler has to generate code to dynamically align doubles on a per-function basis. Doing this is a challenge; the compiler cannot assume the position of the current stack frame. With whole program optimization, the compiler knows much more about the call-tree, and therefore, it can align the stack frame in a root function and keep things aligned through nested calls. Each function is not penalized with figuring the position of its stack frame.

Custom calling convention

As previously mentioned, a single calling convention is not the best for every function. For example, functions passing only a few small arguments benefit greatly from fastcall, but using fastcall also strains the optimizer. The compiler is certainly a better judge of when to use a particular calling convention. With whole program optimization, the compiler knows about all the call sites for a particular function. This lets the compiler customize the calling convention. For example, function arguments could be passed through an available register rather than on the stack. Functions that are exposed outside the program, as would happen in a DLL, will necessarily retain their default calling convention.

Improved memory disambiguation for non-address taken globals

Before whole program optimization, the compiler had a hard time optimizing global variables. This is worthwhile because global variables live in memory and are highly susceptible to cache misses. Unfortunately, because it does not have access to the whole program, the compiler often must assume that global variables can be written to through an assignment to a pointer. With whole program optimization, the compiler and the linker can determine with better accuracy whether the address of a global variable is taken so the compiler knows about pointers to the global variable. If the variable does not change, it can be treated more like a local variable and opened to standard code optimizations.

Small TLS offset encoding

The x86 instruction set uses smaller instruction encodings when an offset is within 128 bytes of a pointer. When organizing the layout of variables in thread-local storage, it is better to place frequently used variables in the first 128 bytes of storage. The linker is the utility that organizes the layout for thread-local variables. Determining which variables are more frequently used requires knowing about the whole program. Knowing the position of the variables in thread-local storage allows the compiler to use a smaller instruction encoding for the variable offset. If a program is heavily threaded, whole program optimization could dramatically reduce the image size.

Using Whole Program Optimization

Fortunately, developers need to do very little to enable whole program optimization. On the command-line, adding the /GL switch is all that is needed. When the /c switch is used to separate the compiling and linking stage, the linker will need the /LTCG switch when any object files were compiled with the /GL switch. When using the Visual Studio integrated development environment, to enable whole program optimization, set this property in the General property page of the project properties’ configuration folder.

Using whole program optimization restricts the ability to use other features of Visual C++ .NET. When compiling with the /GL switch, edit and continue (/ZI), automatic precompiled headers (/YX), and targeting the .NET common language runtime (/clr) are not available.

In real-world code, whole program optimizations have boosted performance as much as 10% to 15%. Of course, this can vary; some programs will benefit more than others. On x86 architectures, 3% to 5% improvement is common.

Common Question About Whole Program Optimization

Can whole program optimization be used on some files, but not others?

Yes. Each source file that is compiled with the /GL switch produces an object file that will use whole program optimization. If an object file is not compiled with /GL, it will contain optimized assembly code using the traditional approach to compiling. Mixing object files built with and without /GL does not have any known issues.

Can I generate assembly files? What do they look like?

Assembly files (.asm) can be generated with LTCG, but because code generation is not done till link time, the assembly file will not be produced until link time as well. The .asm files produced with LTCG are just like without LTCG, but cannot be consumed by MASM.

What does this do to overall build time?

Overall build time does not change significantly. The shorter time in the compiling stage is shuffled to the linking stage, which now includes optimization and assembly code generation.

Conclusion

Link time code generation is a framework that enables whole program optimization. For developers, this means that the Visual C++ team is continuously examining even more ways to improve code through this framework. At the moment, whole program optimizations in Visual C++ .NET provide a significant advance toward making C and C++ programs the best that they can be.

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

© <2001> Microsoft Corporation. All rights reserved.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Brandon Bray (MSFT)
Web Developer
United States United States
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralWarning on using flagmembertaumuon22-Mar-07 1:43 
Hi,
 
We're getting an error trying to turn on the whole program optimisations on a project:
 
warning C4744: 'static unsigned long * tagVARIANT::* ATL::CVarTypeInfo::pmField' has different type in '..\Application\StdAfx.cpp' and '..\Application\SomeClass.cpp': 'int' and 'struct (4 bytes)'
 
Is there any workaround for this?
 
Cheers,
Gary
 
http://www.taumuon.co.uk/jabuka/
taumuon.blogspot.com

GeneralRe: Warning on using flagmemberAndrew Nielsen31-Jul-07 9:25 
Remove the /GL flag from the compile of your code. This will prevent the linker form doing whole program optimization.
GeneralRe: Warning on using flagmembertaumuon4-Aug-07 7:54 
We originally had the flag turned on, to gain performance.
However, when we made our application mixed-mode, we had to turn it off (As the article says, this flag doesn't work when targeting the .NET runtime with /clr).
 
This was for the project for the application's main executable. If we have the flag turned on for all of the other projects in the solution, will we benefit much?
 
Thanks!
 
http://www.taumuon.co.uk
taumuon.blogspot.com

GeneralRe: Warning on using flagmembertaumuon5-Aug-07 22:59 
I knew I shouldn't have posted after being out in the sun too long!
 
What I meant to say, is that this article (http://msdn.microsoft.com/msdnmag/issues/05/01/COptimizations/) says that whole program optimisations should now work in VS 2005. And obviously, if we've got the feature available to improve performance then we'd like to use it.
 
I think the compiler's confused about how pmField is defined in the different headers, but I'm not sure how to cure this. Has anyone else come across this?
 
http://www.taumuon.co.uk
taumuon.blogspot.com

GeneralRe: Warning on using flagmemberhannahb11-Jun-08 15:26 
I have come across it, but like most I do not have a solution. I am currently trying to see if my program runs properly even with the warnings.
 
We will see...
GeneralRe: Warning on using flagmemberxiaoyou26-Aug-11 16:51 
It is said that the mixed OBJs/LIBs are possible when applying /GL flag, the non-GL objs contain actual assembly codes, the GL objs contain specific IL (not clr's IL) codes.
 
The linker can still go through and do whatever it can do to optimize?
 
We are also working on a mixed project, will try whole-program-optimization.
GeneralIntel code opitmization compiler for MSVC 6memberFASTian2-Feb-05 1:43 
Intel offers a code optimizer for both MSVC 6 and VC.NET. Its worth checking out at http://www.intel.com/software/products/compilers/cwin/index.htm[^]
GeneralLinker-generated MAP filemembersrajan23-Jun-04 12:27 
Brandon
 
Is there a Microsoft documentation that explains in sufficient detail how to read, interpret and use the information in the map file? Specifically how much of memory is being used by (a) different parts of the code in the program, and (b) statically allocated data. For a code optimization viewpoint, it helps to know where resources are being used (or misused).
 
Subby
QuestionHow effective is it in practise?sussAnonymous29-Oct-02 2:48 
I would like to know if anyone has measured how much of an improvement it makes in a real commecial application. Has anyone tried it out?
GeneralIR formatmemberAnonymous19-Feb-02 10:31 
Where would one go about obtaining format specs for the IR that is placed in an object file (If one were interested in building a compliant compiler)?
 
Thanks in advance.Cool | :cool:
GeneralRe: IR formatmemberBrandon Bray (MSFT)19-Feb-02 13:55 
Hi there,
Unfortunately the IR has no specs -- that's because it changes on a nearly daily basis. This is why creating .lib files is not recommended, as the .obj files in the static link library will only work with the compiler that produced them. They have no where near the portability of .obj files that contain assembly.
GeneralRe: IR formatmemberAnonymous20-Feb-02 5:28 
It would great if they could get to a point at which a minimal baseline standard could be set that the linker could handle.
 
It would then be hopefully trivial for any existing compiler to be adapted to generate toward that baseline IR.
 

Thanks for your quick answerBlush | :O
GeneralRe: IR formatmemberBrian Ensink7-Apr-04 9:15 
There is some excellent research being done in this area as well. Take a look at the open source LLVM compiler: http://llvm.cs.uiuc.edu. You will find LLVM's IR much more stable and very easy to work with.
QuestionDoes this mean it supports the "export" keyword?memberAnonymous8-Feb-02 15:13 
I've been wanting "export" support for quite some time now. It'd really be nice not to have to stuff all my templates up in my headers (if for no other reason than faster compiles Wink | ;-) .
 
--Chris
QuestionEither - or choice ?memberMichiel Salters17-Jan-02 4:15 
There's no possibility to generate both representations for a single translation unit? That might be useful.

 
--
Michiel.Salters(@)cmg.nl
GeneralDoes it help with templatesmemberPhilippe Mori14-Dec-01 4:15 
Does this change means that it would also be possible to put template source code in a source (CPP) file and have the linker generate the code.
 
It would be really great if the template source code could be put in source file as implementation details would not have to be shown in header in order to be able to generate the final code.
 
In fact, if we have lot of template code, it could even help the compiler to perform better as it would have to do actual instanciation once instead of once for each source file... and have less line of code to read and process at the compilation time.
 
Philippe Mori
GeneralRe: Does it help with templatesmemberBrandon Bray (MSFT)17-Dec-01 13:33 
Unfortunately, it does not mean template code can be put into the CPP files. Believe me -- I feel your pain. The problem is that the template specialization is done along with the parsing and type checking. All the template specialization happens before the code generator and optimizer even has a look at your program.
 
Fortunately, the only problem this creates is having to put implementation details in the header. Indeed, there is a slightly longer parsing time, but the parser, type checker, and template specialization is pretty darned fast (especially compared to everything else the optimizer does).
 
And luckily, whole program optimization might be able to speed up the code generation of template code. The way template specialization are created by the compiler without LTCG is by putting it in the communal data section. That means if two object files have the same specialization, only one copy makes it into the final image produced by the linker. The concept is similar for whole program optimization -- specializations are in the communal data section, and now the compiler knows it only has to work on the code once.
 
I hope that answers your question to your satisfaction. If not, let me know!
 
Cheerio!
Brandon
GeneralVery informativememberAndy Metcalfe13-Dec-01 22:07 
A very informative article - thanks for the info. Cool | :cool:
 
Andy Metcalfe - Sonardyne International Ltd

Trouble with resource IDs? Try the Resource ID Organiser Add-In for Visual C++ 5.0/6.0

"I'm just another 'S' bend in the internet. A ton of stuff goes through my system, and some of the hairer, stickier and lumpier stuff sticks."
- Chris Maunder (I just couldn't let that one past Wink | ;) )
GeneralCode in data segmentmemberKristian Dupont11-Dec-01 21:39 
Hi!
 
This sounds great, I am looking forward to trying it out. I have one question, though. In a large project I am working on, I make an "on-the-fly assembled" BitBlt function by writing code bytes into data memory and then calling a pointer to them. I do this because this function needs to fit a lot of different situations and optimizing for every scenario imaginable would make an enormous amount of code.
I know that this is a hack and to prevent failure, I always push/pop everything and I don't call anything from within my dynamic function.
So my question is how the new compiler/linker system would respond to this kind of code. I don't know much about compilers but I assume that it can tell that I am calling code in the data segment. It works perfectly with VS60 but I don't know what happens with the optimization around the call. To encapsule it, I placed my call in a small (c++) function which I assumed would satisfy the VS60 optimizer.
GeneralRe: Code in data segmentmemberBrandon Bray (MSFT)12-Dec-01 13:43 
Hello Kristian,
As long as you are not calling functions from within these "on-the-fly" functions, whole program optimization should not break you. I hope that helps!
 
Cheerio!
Brandon

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130617.1 | Last Updated 11 Dec 2001
Article Copyright 2001 by Brandon Bray (MSFT)
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid