Click here to Skip to main content
13,632,371 members
Click here to Skip to main content
Add your own
alternative version

Tagged as


1 bookmarked
Posted 20 Feb 2018
Licenced CPOL

Get Started Turbo-Charging Your Applications with Intel® Parallel Studio XE

, 20 Feb 2018
Intel Parallel Studio XE isn’t just a tool to help you profile your applications, it’s a suite that enhances Microsoft Visual Studio and gives you deep insight into the performance of your applications.

Editorial Note

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Building a great application is a challenge… maintaining that application is an architectural challenge and tuning your application for performance is a science. For many of us, tuning our applications can feel like performing surgery in the dark; you don't know if what you're changing will significantly improve things, and you won't know until you stitch things up and run the application again. Thankfully, there are tools available to help us analyze our code and give us direction to improve them. The coolest of these tools that I have had a chance to encounter is Intel Parallel Studio XE.

Introducing Intel® Parallel Studio XE

Intel Parallel Studio XE isn't just a tool to help you profile your applications, it's a suite that enhances Microsoft Visual Studio and gives you deep insight into the performance of your applications. Not only do you get the lightning quick Intel® C++ Compiler, but you also get Intel's compiler for Fortran and Python distribution. Parallel Studio XE comes in three different SKUs:

Composer Edition – Includes the compilers, performance libraries to optimize image processing and data compression, Intel Math Kernel library that's tuned for Intel processors, and the data analytics acceleration library for working with large data sets.

Professional Edition – Includes everything in the Composer Edition and:

Intel® VTune™ Amplifier for profiling your C++, Fortran, Python, Go, and Java code.

Intel® Advisor for optimizing and vectorizing your code

Intel® Inspector for analyzing code for memory and parallel execution errors

Cluster Edition – Includes everything in the Professional Edition and:

Intel® MPI Library to help applications perform better on Intel-based clusters with low latency and sustained scalability

Intel® Trace Analyzer and Collector to profile and analyze MPI communications while checking for errors and tuning for performance.

Intel® Cluster Checker provides a diagnostic expert system to ensure that your cluster is running reliably

I installed the Cluster edition so that I could check out all the cool features from Intel. Installing this on top of Visual Studio 2017 added many cool features that I wasn't expecting to find.

Figure 1 - Fortran Project Templates in Visual Studio 2017

I wasn't expecting to get, but pleasantly surprised to find new Fortran templates for all the standard project types you would expect a desktop programming language to support. Console, Windows Forms, even templates for building DLLs that could be references by other projects.

Additionally, I started seeing the many Intel compiler options appearing throughout my Project configuration and Project menus, allowing me to switch from the Microsoft compiler to the Intel compiler.

Figure 2 - Switch from the Intel Compiler to the Microsoft Visual C++ compiler

Ok ... that's a bit of the basic stuff, but I want to explore more about the Intel Advisor tool.

The Intel Advisor – Testing Against Real Code

I'm always curious about tools that claim to be able to show you not just raw numbers, but also give you actionable advice to make your application better. This level of interaction from a tool requires that it be constructed as more of an expert system, giving advice based on a series of known circumstances.

I downloaded the Intel Mandelbrot demo, which generates the familiar Mandelbrot fractal image using four different techniques:

  • Standard serial processing – nothing fancy, just a single threaded loop
  • Simd processing – running single instructions in parallel
  • Threaded processing –running processes in parallel
  • Both simd and threaded processing

I set the size of the output image to 1024 x 2048 and compiled using the Intel Compiler on my Yoga laptop with an Intel® Core™ i7 processor and 16GB of RAM. The basic scalar function has the following code:

// Description:
// Determines how deeply points in the complex plane, spaced on a uniform grid, remain in the Mandelbrot set.
// The uniform grid is specified by the rectangle (x1, y1) - (x0, y0).
// Mandelbrot set is determined by remaining bounded after iteration of z_n+1 = z_n^2 + c, up to max_depth.
// Everything is done in a linear, scalar fashion
// [in]: x0, y0, x1, y1, width, height, max_depth
// [out]: output (caller must deallocate)
unsigned char* serial_mandelbrot(double x0, double y0, double x1, double y1,
  int width, int height, int max_depth) {
  double xstep = (x1 - x0) / width;
  double ystep = (y1 - y0) / height;
  unsigned char* output = static_cast<unsigned char*>(_mm_malloc(width * height * sizeof(unsigned char), 64));
  // Traverse the sample space in equally spaced steps with width * height samples
  for (int j = 0; j < height; ++j) {
    for (int i = 0; i < width; ++i) {
      double z_real = x0 + i*xstep;
      double z_imaginary = y0 + j*ystep;
      double c_real = z_real;
      double c_imaginary = z_imaginary;
	      // depth should be an int, but the vectorizer will not vectorize, complaining about mixed data types
      // switching it to double is worth the small cost in performance to let the vectorizer work
      double depth = 0;
	      // Figures out how many recurrences are required before divergence, up to max_depth
      while(depth < max_depth) {
        if(z_real * z_real + z_imaginary * z_imaginary > 4.0) {
          break; // Escape from a circle of radius 2
        double temp_real = z_real*z_real - z_imaginary*z_imaginary;
        double temp_imaginary = 2.0*z_real*z_imaginary;
        z_real = c_real + temp_real;
        z_imaginary = c_imaginary + temp_imaginary;
  output[j*width + i] = static_cast<unsigned char>(static_cast<double>(depth) / max_depth * 255);
  return output;

This generates an image and a wrapping function writes it to disk. The results are impressive:

Figure 3 - Output of the Mandelbrot application

The simd implementation is exactly the same, except for the addition of a pragma for loop. The threaded implementation replaces the outer for loop with a call to a parallel for.

The last implementation that uses both threaded and simd processing is the same parallel for loop, but this time decorated with the pragma.

My performance running this application with each of the four techniques, five times each and taking the average performance produces the following results:

  • Serial: 240ms
  • Simd: 245ms
  • Threaded: 76ms
  • Threaded + simd: 84ms

Those times are in milliseconds, and I'm feeling pretty good about the performance of my laptop. As the parallelism of the application goes up, and more optimized parallel processing algorithms are used, the faster the generation of the Mandelbrot image. The second number, the SIMD parallel processing doesn't look like I would expect and is reporting that it is running slower than the scalar approach. Let's run the Intel Advisor against this code and see if it can help us figure out why this vectorized execution is slower than a scalar execution. The summary report from the Advisor tool showed me that my vectorized code wasn't seeing much benefit:

This graph shows the yellow filled section is the performance of the vectorized code, and the black triangle inside the bar shows where scalar performance would expect to perform. The Advisor has identified something is wrong here, and its detailed reports would have more information.

The initial roofline chart showed me a bit of what I expected:

Each dot plotted is a loop, and we should focus on improving the big and slow loops that are lowest on the chart. The red and yellow bubbles in the graph are the serial and SIMD executions of the Mandelbrot algorithm. I was expecting the serial execution to be low on the graph, but I was not expecting to see the recommendation to use the AVX2 instruction set, and that could explain why the SIMD execution was slower than the scalar, since the current instruction set was limiting it to a small vector length that was easily crippled by inefficiency. I added this switch to the compiler configuration for my application, rebuilt and now saw these times reported by the Mandelbrot application:

  • Serial: 216ms
  • Simd: 107ms
  • Threaded: 68ms
  • Threaded + simd: 38ms

That's a lot better! I immediately see a 49% performance improvement on my best algorithm, dropping from 80ms to 41ms, and an improvement of almost 75% over the scalar execution. Additionally, my SIMD implementation now reports 50% faster than the scalar model, as expected. I re-ran the Advisor against the compiled executable and my new recommendations report looks like the following:

Here is where we're purposefully showing code that isn't optimized. With the AVX2 instruction set in place, it still can't optimize that one red bubble. That bubble is the results of the serial execution of the Mandelbrot algorithm in the scalar function. The other parallel executions are now running much faster, and the advisor is recommending that we eliminate that serial execution. There is also a complete report of the recommendations that the Intel Advisor will report on when you click that C++ link next to "All Advisor-detectable issues"

We can dig in a little further and look at the hot-spots in our code by selecting the ‘Top Down' tab in the advisor and reviewing its contents:

Interestingly, its displaying an issue in the scalar processing where a while loop was declared and suggesting that we explicitly calculate the number of times through the loop to optimize processing further. This happens to be one of the optimizations that this sample code is NOT including to show the difference between the various processing models. You can also see the advisor recommending the scalar outer loops should be decorated with a SIMD directive and would immediately see a 50% improvement in performance changing from scalar processing.

Additionally, scrolling down in the same report, we can see that the threaded execution reports a Total Time % of 3.8%, almost 8x faster than the serial version.


This code sample was constructed specifically to show the various performance capabilities of different parallel execution models. The Intel Advisor that comes with the Parallel Studio can identify the slower processing models and recommend updates. In fact, not only did it recommend choosing another model, it even recommended using a better compiler instruction set to get the most out of my Intel® Core™ i7 processor. It's smarts like these in my tools that I look for so that I don't have to spend time being a computer scientist and figuring out how to optimize for my processor; let the experts from Intel do that and take advantage of their expertise. I can then focus on writing great applications and trust in my partner Intel to help make my application great.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Jeffrey T. Fritz
Program Manager
United States United States
Jeffrey is a software developer coach, architect, and speaker in the Microsoft.Net community. He currently works as a program manager for the Microsoft .NET Developer Outreach group. He has delivered training videos on Pluralsight, WintellectNow, and on YouTube. Jeffrey makes regular appearances delivering keynotes, workshops, and breakout sessions at conferences such as TechEd, Ignite, DevIntersection, CodeStock, FalafelCon, VSLive as well as user group meetings in an effort to grow the next generation of software developers.

You may also be interested in...


Comments and Discussions

-- There are no messages in this forum --
Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web04-2016 | 2.8.180712.1 | Last Updated 20 Feb 2018
Article Copyright 2018 by Jeffrey T. Fritz
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid