Intel® Advisor Review

theonemule

5.00/5 (1 vote)

Feb 27, 2017

CPOL

6 min read

8173

To fully take advantage of parallelization features, developers have to change how they code. But a great deal of optimizations can be made through Intel’s parallelization tool, Intel Advisor.

Intel has long been a leader in processor technology and has pushed processors to the physical limits of how fast they can go. But in order to keep up with the ever-increasing demands of applications, Intel has sunk billions into research and development efforts to squeeze even the most compute possible out of every CPU cycle, and they have done this through parallelization. Historically, CPUs performed one operation per thread per CPU cycle. It didn’t take long before engineers started to figure out how to perform tasks in parallel using multiple threads. Multithreading found its way into the processors initially through multiple CPUs, then through hyperthreaded CPUs, then eventually through multiple hyperthreaded cores on a CPU. The number of threads on an Intel CPU is usually twice the number of cores on the CPU, thanks to hyperthreading. For instance, an Intel® Core processor i7 6700HQ is a quad-core CPU, so it has eight threads.

Alongside multithreading, Intel also added vectorization. Vectorization is another form of parallelism that allows a single instruction to perform operations on multiple data items – Single Instruction Multiple Data (SIMD). Vectorization was first implemented by Intel in 1995 with MMX instructions and then continuously expanded with Streaming SIMD Extensions (SSE) in 1999, AVX (Advanced Vector Extensions) in 2011, AVX2 in 2013 and AVX-512 in 2015. Each advance adds new instructions, but the principle is basically the same. Intel claims that the combined impact of vector parallelism and thread parallelism can be increase performance by as much as 187x on some algorithms. This is an order of magnitude over threading or vectorization alone.

To fully take advantage of parallelization features, developers have to change how they code. But a great deal of optimizations can be made through Intel’s parallelization tool, Intel® Advisor. By design, Intel Advisor analyzes applications as they run, looking for areas in the application that can benefit from both threading and vectorization, but the focus here though is on vectorization.

Vectorization is typically performed on loops operating on arrays of data. Vectorization works by performing a number of operations on that data in a CPU cycle. For instance, if a loop were iterating over an array of integers and adding some value to each of the elements in the array, then the loop would first increment the array index, and then perform the addition on the element for that index:

for(i = 0; i < 1000; i++){
	arrayA[i] = arrayA[i] + 2;
}

Vectorized, the index would be incremented much quicker, and there would be parallel operations on the array. The loop would be “unrolled” to look like this:

for(i = 0; i < 1000; i = i + 4){
	arrayA[i] = arrayA[i] + 2;
	arrayA[i + 1] = arrayA[i + 1] + 2;
	arrayA[i + 2] = arrayA[i + 2] + 2;
	arrayA[i + 3] = arrayA[i + 3] + 2;
}

Notice that it’s incrementing the array index by four now. The vectorized form would look something like this:

for(i = 0; i < 1000; i = i + 4){
	addNumberToFourElementsInASingleCycle(a[i], 2);
}

This is more pseudocode than anything, but it would functionally perform the same task as the first scalar version and the unrolled version, but do it in roughly a fourth the number of cycles. For vectorization to work, there can’t be any data dependencies that would make parallel functions produce unexpected results. Understanding this is important, because not every loop in every program will benefit from or is even able to use vectorization. So while the tool does have some rich features, the value of the tool, at least for vectorization, is somewhat limited to specific kinds of applications.

There are three ways to use the tool: through the Intel-provided GUI, or through the Intel Visual Studio integrations that are installed, or through the command line (whihc is quite popular in Linux). It also allows the user to conveniently switch between the Intel C++ compiler which is installed with the package or the Microsoft Visual C++ compiler. The output from both compilers can be analyzed by the tools in Intel Advisor, but with Intel Compiler it can show more recommendations than with MSVS/GCC.

The results of vectorizaton are easy to test by comparing vector-optimized loops with unoptimized loops. The built-in analysis tools for Intel Advisor supply this. Implementing and testing the recommendations made by the vectorization was fairly straightforward. The output suggested that parallelization optimizations be made in the compiler settings as well as in the code.

The first pass using the analysis tool rendered results on the application as it was running, without any of the recommendations for parallelization turned on. The application needs to have a runtime long enough so that the profiler can capture telemetry from the application as it runs. After the profiler runs, it produces a chart showing the results of the telemetry from the loops inside of the application.

The second pass implemented the recommendations. For this particular example, it was turning the vectorization option on for the compiler and adding two #pragma tags. The resulting gains were in terms of efficiency in the optimized code.

Interestingly enough, after running the application a few times, there didn’t seem to be any gains in actual computation time. This could be a result of the fact that the application is rather small and executes quickly.

Vectorization works well when dealing with array buffers, which is why it’s marketed as a multimedia acceleration under Intel SSE. Multimedia usually works by reading a byte stream and operating on that byte stream using data-independent indexes on the stream. The kind of applications that will most benefit from these sorts of optimizations are those that work with predictable data sets where the operations and indexes on those datasets are independent of the data itself. A few examples include media streaming applications, applications that work on bitmaps such as graphics manipulations, and applications that use rendering engines, such as games. Consequently, all of these kinds of applications are also compute intensive too. Line of business applications, however, probably won’t see a lot of benefits from running the vector analysis tools on the application; however, the threading analysis of the tool would certainly be worth its weight in gold because line of business applications are usually multiuser or mult-tenant applications that need multithreading.

Intel Advisor 2017 is a part of Intel® Parallel Studio XE, a bundle that contains optimizing compilers, performance libraries and other analysis tools like their performance profiler Intel® VTune™ Amplifier. Parallel Studio retails for $1,599. The Intel Parallel Studio XE software package is a fairly hefty download at over 3 GB, but it comes with two straightforward options: an all-inclusive download or a piecemeal download that installs only the wanted features. This evaluation used Intel Visual Studio 2015 Community Edition on Windows 10 64-bit with Intel Advisor 2017. Generally speaking, the application was stable with no crashes or hiccups, but it did feel sluggish on some pretty advanced hardware. Likewise, it’s obvious Intel didn’t spend a lot of time trying to make the UX of the application modern, as it feels more like an app written for a Windows 98 PC. Although it could use some improvements in the UI/UX realm, the focus of the application is not on a slick UI/UX, but rather on improving application performance—most of the work has gone into what isn’t seen. The real benefit, though, is for developers writing apps-complex, CPU-intense application. Spending $1,599 on the app is money well spent if it means that you don’t have to manually comb through thousands of lines of code to look for optimization. Intel Advisor can do the work automatically, and the results really speak for themselves.