Generally speaking, there are 6 steps for the performance optimization with Intel® C++ Compiler.
- Compilation without optimization
- Enable general optimization
- Enable processor-specific optimization
- Use IPO optimization
- Use PGO optimization
- Tune Auto Vectorization
1. Compilation without optimization
First of all, this is the first step to get started. The performance tuning needs to be based on a qualified application. You need to make sure the application correctness before you start the performance tuning. Normally we use -O0 option in this step for easy debugging. If your application has already been verified either with GNU* compiler or Intel® C++ compiler, you can start the next step for performance tuning.
2.Enable general optimization
We will enable the most common compiler optimizations in this step. You have multiple different options to select which depends on your application scenario.
-O1 and -Os (Linux*) or /O1 and /Os (Windows*)
This option enables optimizations for speed and disables some optimizations that increase code size and affect speed. To limit code size, this option enables global optimization which includes data-flow analysis, code motion, strength reduction and test replacement, split-lifetime analysis, and instruction scheduling. This option also disables inlining of some intrinsics. If -O1 is specified, then by default the -Os option will also be enabled which focuses on optimizations that do not increase code size. When using the O1 option, the compiler's auto-vectorization functionality is disabled.
-O2 (Linux*) or /O2 (Windows*)
This option enables optimizations for code speed. This is the generally recommended optimization level. The compiler vectorization is enabled at O2 and higher levels. With this option, the compiler performs some basic loop optimizations, inlining of intrinsic, Intra-file interprocedural optimization, and most common compiler optimization technologies.
-O3 (Linux*) or /O3 (Windows*)
Performs O2 optimizations and enables more aggressive loop transformations such as Fusion, Block-Unroll-and-Jam, and collapsing IF statements. Using the O3 optimizations may not cause higher performance unless loop and memory access transformations take place. The optimizations may slow down code in some cases compared to O2 optimizations. The O3 option is recommended for applications that have loops that heavily use floating-point calculations and process large data sets.
-no-prec-div (Linux*) or /Qprec-div- (Windows*)
This option will enable the optimizations that give slightly less precise results than full IEEE division. In this case, the compiler may change floating-point division computations into multiplication by the reciprocal of the denominator. For example, A/B is computed as A * (1/B) to improve the speed of the computation. If your application's precision is not as sensitive as the full IEEE division, you may enable this option.
-ansi-alias (Linux*) or /Qansi-alias (Windows*)
This option tells the compiler to assume that the program adheres to ISO C Standard aliasability rules. If your program adheres to these rules, then this option allows the compiler to optimize more aggressively. If it doesn't adhere to these rules, it can cause the compiler to generate incorrect code. Enable this option only when your program adheres to ISO C Standard alias ability rules.
A common tip for this step is, you may try to compile your application with both the O2 and O3 options, and use the option which provides higher performance.
3. Enable processor-specific optimization
If you have a specific target processor for your application, you can use option -x<code> (Linux*) or /Qx<code> (Windows*) to enable the processor-specific optimizations. This option tells the compiler which processor features it may target, including which instruction sets and optimizations it may generate. It also enables optimizations in addition to Intel® feature-specific optimizations.
The specialized code generated by this option may only run on a subset of Intel® processors. The resulting executables created from these option code values can only be run on Intel® processors that support the indicated instruction set. The binaries produced by these code values will run on Intel® processors that support the specified features.
For most embedded application development, since the target processor is a fixed one, you should enable this option to get the higher performance. For example, if your embedded device is based on Intel® Atom processor and your application will only run on this device, you can enable the option –xSSSE3_ATOM to get the best performance.
For the Intel Silvermont based Atom processors, you can enable the option -xATOM_SSE4.2 to get the optimizations for Silvermont cores.
When using the Intel C++ Compiler for the Quark SoC X1000, you can make use of the following two compiler switches:
-mia32 tells the compiler to generate code for IA-32 architecture
-falign-stack=assume-4-byte tells the compiler to assume the stack is aligned on 4-byte boundaries. The compiler can dynamically adjust the stack to 16-byte alignment if needed. This will reduce the data size necessary to call routines
You can also refer to the following link for more details of the supported <code> value of -x option.
4. Use IPO optimization
Interprocedural Optimization (IPO) is an automatic, multi-step process that allows the compiler to analyze your code to determine where you can benefit from specific optimizations. With IPO options, you may see additional optimizations for Intel microprocessors than for non-Intel microprocessors.
Before your enable the IPO optimization, you must change your original archive “ar” and linker “ld” to Intel® “xiar” and “xild”. Please refer to the following article on how to port from GNU* compiler to Intel® C++ Compiler for more details.
IPO is a cross-file optimization. Normally you can just enable the IPO optimization by the option “-ipo”. By this option, the compiler will add some intermediate data in the generated object files. During the linking stage, the compiler is invoked to perform the multiple-files optimization based on the intermediate data in the object files. Several optimizations can be done in the IPO optimization.
The memory requirement will be increased and the compilation time will be longer if IPO optimization is enabled. Especially for very bigger applications, you may find the compilation time is increased significantly.
Please refer to the Intel® C++ Compiler User and Reference Guide> Key Features > Interprocedural Optimization (IPO) for more details on IPO optimization.
5. Use PGO optimization
Profile-guided Optimization (PGO) improves application performance by reorganizing code layout to reduce instruction-cache problems, shrinking code size, and reducing branch mispredictions. PGO provides information to the compiler about areas of an application that are most frequently executed. By knowing these areas, the compiler is able to be more selective and specific in optimizing the application.
PGO consists of three phases or steps.
Step one is to instrument the program. In this phase, the compiler creates and links an instrumented program from your source code and special code from the compiler. In this step, you need to enable the option “-prof-gen” to let the compiler to generate the instrumented binary.
Step two is to run the instrumented executable. Each time you execute the instrumented code, the instrumented program generates a dynamic information file, which is used in the final compilation.
Step three is a final compilation. When you compile a second time, the dynamic information files are merged into a summary file. Using the summary of the profile information in this file, the compiler attempts to optimize the execution of the most heavily traveled paths in the program. In this step, you need to enable the option “-prof-use” to let the compiler use the profilers which are produced in step 2.
For the embedded development, you may need to use the option “-prof-dir=<val>” to specify the path where the profiler is generated during execution. Please note that the “<val>” is the folder name in your target system. You will find the profilers generated in that folder after program exited.
You need to copy the generated profilers (.dyn files) in the target to your host machine, put it in a folder in your host machine and then change the value of -prof-dir option to specify the folder of the host machine where the profilers are stored.
Below are the basic steps you need to do to enable the PGO optimization for embedded system.
1. Enable the option -prof-gen and the option -prof-dir=<val> where <val> stands for the folder name on the target machine.
2. After compilation, run your program on the target machine. Find the .dyn files in the folder which specified by -prof-dir, copy the .dyn files to host machine.
3. Change the option -prof-gen to –prof-use, change the option value of –prof-dir to specify the folder where the profilers are stored in host machine, recompile the application.
You need to make sure your program can exit in order to have the profiler to be generated. In the embedded system, if your program is running infinitely without an exit point, you need to do some addition work to make sure the profiler can be generated.
1. Manually add exit code in your program.
2. Add the PGO API _PGOPTI_Prof_Dump_All() in your source to dump the profiler.
3. Using environment to make regular dumps, for example dump all in one file every 5000 microsecond by using following environment:
export INTEL_PROF_DUMP_INTERVAL 5000
export INTEL_PROF_DUMP_CUMULATIVE 1
Please refer to the Intel® C++ Compiler User and Reference Guide> Key Features> Profile-Guided Optimization (PGO) for more details on PGO optimization.
6. Tune Auto-Vectorization
The automatic vectorizer (also called the auto-vectorizer) is a component of the Intel® compiler that automatically uses SIMD instructions in the Intel® Streaming SIMD Extensions (Intel® SSE, SSE2, SSE3 and SSE4 Vectorizing Compiler and Media Accelerators), and the Supplemental Streaming SIMD Extensions (SSSE3) instruction sets, and the Intel® Advanced Vector Extension instruction set. The vectorizer detects operations in the program that can be done in parallel and converts the sequential operations to parallel; for example, the vectorizer converts the sequential SIMD instruction that processes 2, 4, 8 or up to 16 elements into a parallel operation, depending on the data type.
Compiler vectorization is a key point for most data processing application which contains large data processing loops. The compiler also provides detailed report which helps you to know which is loop is vectorized and which loop is not vectorized and the reasons why. There is an online white paper which can help you to understand the compiler auto-vectorization functionality and help you to write vectorizable loops. Please refer to the following links for more details:
According to the above steps, we may have the following Intel® C++ compiler options (Linux*) which can be easily applied to your application to get the high performance.
-O2/O3 -no-prec-div -x<code> -ipo -prof-gen/-prof-use -prof-dir=<val>
The Intel® C++ Compiler provides lots of high level and specific features for you to tune your application performance. The steps described in this article apply to most applications. There are also many other compiler options which may help you to do fine-level performance optimization. Please refer to the Intel® C++ Compiler User and Reference Guide for more details.
You can also use Intel® Vtune Amplifier to investigate the application's performance hotspots and architectural performance issues. Intel Vtune Amplifier is a highly recommended tool for performance analysis and performance tuning.