Click here to Skip to main content
Click here to Skip to main content

Visual C++ 11 Beta Benchmark of Parallel Loops

, 29 Apr 2012 Ms-PL
Rate this:
Please Sign up or sign in to vote.
Showdown between OpenMP, Parallel Patterns Library and Auto-Parallelizer

Introduction

Visual Studio 11 Beta was released in February 2012 and was updated in April 2012. Among its many new features like ISO C++11 concurrency features and C++ AMP, the most interesting of them are Auto-Vectorizer and Auto-Parallelizer. Auto-Vectorizer is turned on by default and generates SIMD code for for-loop by executing multiple iterations at the same time, while Auto-Parallelizer (not turned on by default) runs the for-loop in parallel using multiple threads. This is what MSDN have to say about their difference.

There are some key differences between auto-vectorization and auto-parallelization. Auto-vectorization is always on and requires no user interaction, but auto-parallelization requires the programmer to decide which loops are to be parallelized. Also, vectorization improves the performance of loops on single-core CPUs that support SIMD instructions, but parallelization improves the performance of the loop on multiple CPUs and multi-core CPUs. The two features can work together so that a vectorized loop is then parallelized across multiple processors.

The reader may ask why I am not doing an article on Auto-Vectorizer: I found out that for the particular floating point computation I am doing, auto-vectorized code is only providing marginal improvement (~10% instead of 4x) over the VS2010 non-vectorized code: it could be seen in the generated assembly list that the code is indeed vectorized. And handcrafted vectorized code is only providing ~10% improvement over auto-vectorized code.

For this article, we will briefly look at OpenMP, Parallel Patterns Library and Auto Parallelizer methods available to parallelize for-loops. We do not look at features which are not related to for-loop. Without further ado, let's begin our article!

Bloom Effect

For our benchmark application, we make use bloom effect which is workload intensive enough for our purpose. Bloom effect is described on Wikipedia as "Bloom (sometimes referred to as light bloom or glow) is a computer graphics effect used in video games, demos and high dynamic range rendering (HDR) to reproduce an imaging artifact of real-world cameras." and is usually implemented using graphics shaders. The bloom efect is most beautiful on photos of dark scene with few lights. I simply port the Direct3D Bloom pixel shader from the excellent WPF Shader Article. The original source is presented below.

/// <class>BloomEffect</class>

/// <description>An effect that intensifies bright regions.</description>

//-----------------------------------------------------------------------------------------
// Shader constant register mappings (scalars - float, double, Point, Color, Point3D, etc.)
//-----------------------------------------------------------------------------------------

/// <summary>Intensity of the bloom image.</summary>
/// <minValue>0</minValue>
/// <maxValue>4</maxValue>
/// <defaultValue>1</defaultValue>
float BloomIntensity : register(C0);

/// <summary>Saturation of the bloom image.</summary>
/// <minValue>0</minValue>
/// <maxValue>10</maxValue>
/// <defaultValue>1</defaultValue>
float BloomSaturation : register(C1);

/// <summary>Intensity of the base image.</summary>
/// <minValue>0</minValue>
/// <maxValue>4</maxValue>
/// <defaultValue>1</defaultValue>
float BaseIntensity : register(C2);

/// <summary>Saturation of the base image.</summary>
/// <minValue>0</minValue>
/// <maxValue>10</maxValue>
/// <defaultValue>1</defaultValue>
float BaseSaturation : register(C3);

//--------------------------------------------------------------------------------------
// Sampler Inputs (Brushes, including ImplicitInput)
//--------------------------------------------------------------------------------------

sampler2D implicitInputSampler : register(S0);


//--------------------------------------------------------------------------------------
// Pixel Shader
//--------------------------------------------------------------------------------------
float3 AdjustSaturation(float3 color, float saturation)
{
    float grey = dot(color, float3(0.3, 0.59, 0.11));
    return lerp(grey, color.rgb, saturation);
}

float4 main(float2 uv : TEXCOORD) : COLOR
{
    float BloomThreshold = 0.25f;

    float4 color = tex2D(implicitInputSampler, uv);
    float3 base = color.rgb / color.a;
    float3 bloom = saturate((base - BloomThreshold) / (1 - BloomThreshold));
    
    // Adjust color saturation and intensity.
    bloom = AdjustSaturation(bloom, BloomSaturation) * BloomIntensity;
    base = AdjustSaturation(base, BaseSaturation) * BaseIntensity;
    
    // Darken down the base image in areas where there is a lot of bloom,
    // to prevent things looking excessively burned-out.
    base *= (1 - saturate(bloom));
    
    // Combine the two images.
    return float4((base + bloom) * color.a, color.a);
}

I am not going to show the C++ source here because C++ bloom source is more verbose due to Direct3D shader has better support for vector operation. float3 and float4 are examples of an vector float type. Note: Vector here does not refer to STL vector. In order to write the C++ version, I have to manually implement the following functions dot product (dot), saturate to clamp the values between 0.0 to 1.0 and linear interpolation (lerp). lerp function in Direct3D shader is implemented as x + s(y-x) where x and y is first and second vector, and s is the value to do interpolation.

color.rgb begs some explanation; the readers might have difficulty in understand the swizzling syntax in the form of color.rgb. color is a variable of float4 type which is a vector of 4 float elements. Basically, color.rgb tells the shader compiler to take the r, g and b elements to form a new float3 variable (vector of 3 float elements) in the same rgb order. Whereas color.a tells the compiler to take the a element of color into a scalar float variable.

Serial Loop

This is the reference source code for the for-loop we will be using to parallelize. One thing to note, is that when we execute the for-loop in parallel, the for-loop is not longer executed in sequence and we do not know which sequence it might be. Therefore, only loops with no loop dependencies can be executed correctly in parallel. To extract good performance, the number of iterations must be sufficiently large enough and/or the amount of work to done in each iteration must be intensive enough. However, the performance will never match the ratio of available CPU cores as there are always some overhead to create and schedule threads.

DWORD startTime = timeGetTime();

for(UINT row = 0; row < bitmapDataDest.Height; ++row)
{
    BloomEffect effect(m_fBloomIntensity, m_fBloomSaturation, m_fBaseIntensity, m_fBaseSaturation);
    for(UINT col = 0; col < bitmapDataDest.Width; ++col)
    {
        UINT index = row * stride + col;

        pixelsDest[index] = effect.ComputeBloomInt(pixelsSrc[index]);
    }
}

DWORD endTime = timeGetTime();

OpenMP Loop

OpenMP 2.0 has been in Visual C++ since version 8.0. It is supported by other C++ compliers as well. By default, OpenMP is not turned on in Visual Studio. To turn OpenMP on, we need to go the project properties, under the Language of C/C++ section and select Yes (/openmp) for Open MP Support. Please note: OpenMP does not have the mechanism to handle exception thrown in its loop.

DWORD startTime = timeGetTime();

#pragma omp parallel for
for(int row = 0; row < bitmapDataDest.Height; ++row)
{
    BloomEffect effect(m_fBloomIntensity, m_fBloomSaturation, m_fBaseIntensity, m_fBaseSaturation);
    for(UINT col = 0; col < bitmapDataDest.Width; ++col)
    {
        UINT index = row * stride + col;

        pixelsDest[index] = effect.ComputeBloomInt(pixelsSrc[index]);
    }
}

DWORD endTime = timeGetTime();

To parallelize the for-loop, we put a #pragma omp parallel for before the for-loop to indicate to OpenMP that we wish to parallelize that for-loop.

Parallel Patterns Library Loop

Parallel Patterns Library (PPL) is a task-based concurrency library introduced in Visual Studio 2010. Before using PPL, we have to include its header ppl.h and use its Concurrency namespace.

#include <ppl.h>

DWORD startTime = timeGetTime();

using namespace Concurrency;
parallel_for((UINT)0, bitmapDataDest.Height, [&](UINT row)
{
    BloomEffect effect(m_fBloomIntensity, m_fBloomSaturation, m_fBaseIntensity, m_fBaseSaturation);
    for(UINT col = 0; col < bitmapDataDest.Width; ++col)
    {
        UINT index = row * stride + col;

        pixelsDest[index] = effect.ComputeBloomInt(pixelsSrc[index]);
    }
});

DWORD endTime = timeGetTime();

The first parameter is the starting value of the index and the second parameter is one past the maximum value to iterate. The third parameter is a lambda to execute for each iteration. [&] indicates that we are capture the variables used in the lambda by reference. For more information on C++ lambdas, readers can refer to this link. Reader should note that the operation to invoke the lambda has some overheads. For the example below, PPL would fare several orders in magnitude behind OpenMP and Auto Parallelizer and serial code because the lambda invocation is more expensive than the add operation.

using namespace Concurrency;
parallel_for(0, 1000000, [&](UINT i)
{
    c[i] = a[i] + b[i];
});

Auto Parallelizer Loop

Auto Parallelizer is designed to work in tandem with the Auto Vectorizer, meaning Auto Parallelizer can parallelize loops which are auto-vectorized. For loops which has been auto-vectorized, user can expect a 4X increase of performance because 4 elements are executed using SSE in a single iteration. Combined with Auto Parallelizer which say running on a 4 core machine, the performance could theoretically be 16X!

Auto Parallelizer in Visual Studio 11 Beta is not turned on by default. To turn Auto Parallelizer on, we need to go the project properties, under the Code Generation of C/C++ section and select Yes (/Qpar) for the Enable Parallel Code Generation. To enable Auto Parallelizer for the for loop, we add a pragma in the form of #pragma loop(hint_parallel(N)) where N is a literal constant. If user try to replace it with a integer constant variable based on the number of CPU cores detected during the runtime, the compiler will complain that hint_parallel has to be literal constant.

DWORD startTime = timeGetTime();

#pragma loop(hint_parallel(8))
for(UINT row = 0; row < bitmapDataDest.Height; ++row)
{
    BloomEffect effect(m_fBloomIntensity, m_fBloomSaturation, m_fBaseIntensity, m_fBaseSaturation);
    for(UINT col = 0; col < bitmapDataDest.Width; ++col)
    {
        UINT index = row * stride + col;

        pixelsDest[index] = effect.ComputeBloomInt(pixelsSrc[index]);
    }
}

DWORD endTime = timeGetTime();

Auto Parallelizer does not seem to work in the benchmark application. This is why I attached a console project to show Auto Parallelizer works. I have reported the bug at Microsoft Connect. The console application makes use of an array while the MFC application make use of an aliased pointer to an array. My guess is it did not work probably due to an aliased pointer to array is used, instead of an array.

Benchmark Results

The benchmask is done on 8 core hyperthreading Intel i7 870 running at 2.93GHz. Since i7 870 is a CPU with hyperthreading support, it has effectively 4 real cores (We divide the number of cores by 2 if it is hyperthreading CPU). Below is the benchmark results for running release mode with no other intensive programs running on the system.

Serial : 72ms
OpenMP : 16ms
PPL    : 12ms
AutoP  : 62ms

The reader may want to run the benchmark with his/her usual programs running to see how his/her program would fare under the real-life condition in which his/her users will be using their program.

Conclusion

We have looked at the performance of OpenMP, Parallel Patterns Library and Auto-Parallelizer. Sadly, Auto Parallelizer is not working in the benchmark application. There is no reason to choose one over the other though we have to be careful about the PPL's lambda invocation overhead in comparison to the workload. On some PCs, PPL has better performance than OpenMP. For code portability, user may stick to OpenMP because it is implemented by other C++ compilers as well while PPL and Auto-Parallelizer are only found in Microsoft Visual Studio. To build the sample source code, users need to download and install Visual Studio 11 Beta. Any feedback on how I could improve the benchmark application or the article is most welcome.

Thank you for reading!

History

  • 2012-04-30 : Added explanation on HLSL swizzling operator
  • 2012-04-25 : First Release

License

This article, along with any associated source code and files, is licensed under The Microsoft Public License (Ms-PL)

Share

About the Author

Shao Voon Wong
Software Developer
Singapore Singapore
No Biography provided

Comments and Discussions

 
QuestionVery good Pinmemberfkassaie27-Oct-13 20:43 
GeneralMy vote of 5 Pinmemberhatemtaleb28-Aug-13 3:33 
QuestionControl benchmark PinmemberDominic Amann30-Apr-12 9:36 
AnswerRe: Control benchmark PinmemberWong Shao Voon30-Apr-12 17:45 
GeneralMy vote of 5 PinmemberIlka Guigova30-Apr-12 8:51 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.141022.2 | Last Updated 29 Apr 2012
Article Copyright 2012 by Shao Voon Wong
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid