Click here to Skip to main content
15,867,835 members
Articles / Desktop Programming / MFC

Fast SIMD Prototyping

Rate me:
Please Sign up or sign in to vote.
4.97/5 (51 votes)
5 Mar 2012Ms-PL9 min read 73.9K   1.4K   99  
Prototype SIMD vectorized code effortlessly
This is an old version of the currently published article.

Introduction

Many a times, when we convert the computation code to SIMD using intrinsic, the SIMD code take more time and effort to write and it is unreadable and harder to maintain, compared to the scalar code. We do have to maintain a version of scalar code so that the application can fallback on when executed on a system without the SIMD support. More often than not, we discovered, to our dismay, the vectorized code is lower in performance. It is one thing to use well-optimized SIMD library and another to handcode our own library using SIMD. In this article, we will introduce some SIMD vector class and their overloaded operators (+,-,*,/) that shipped with Visual C++ (the class are actually from Intel C++ Libraries) to ease our SIMD writing and lessen the effort to write readable SIMD code. Would you rather read/write z = x + y; or z = _mm_add_ps(x, y);? The answer is obvious. In this way, we can throw away our hand-written vectorized code without much heartpain in the event that they are found to be slower in execution.

Simple Example

Let me illustrate with an simple example of using SSE (not SSE2) float vector class, F32vec4, for integer to do addition of 4 numbers. F32vec4 is defined in the fvec.h header.

C++
F32vec4 vec1(1.0f, 2.0f, 3.0f, 4.0f);
F32vec4 vec2(5.0f, 6.0f, 7.0f, 8.0f);

F32vec4 result = vec1 + vec2;

Without using the F32vec4 class and using pure SSE intrinsic, the code would be as follows. As you can see, using intrinsic, we have to either remember them or look them up in the documentation whenever we need to use them. For those who are looking into delaying the onset of dementia or Alzheimer, can try memorizing those hard-to-remember intrinsic names.

C++
__m128 vec1 = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 vec2 = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);

__m128 result = _mm_add_ps(vec1, vec2);

Below is the table which lists the vector classes and their associated header files from Optimizing software in C++ ebook.

Vector class Size of each element, bits Number of elements Type of elements Total size of vector, bits Header file
Is8vec8 8 8 char 64 ivec.h
Iu8vec8 8 8 unsigned char 64 ivec.h
Is16vec4 16 4 short int 64 ivec.h
Iu16vec4 16 4 unsigned short int 64 ivec.h
Is32vec2 32 2 int 64 ivec.h
Iu32vec2 32 2 unsigned int 64 ivec.h
I64vec1 64 1 __int64 64 ivec.h
Is8vec16 8 16 char 128 dvec.h
Iu8vec16 8 16 unsigned char 128 dvec.h
Is16vec8 16 8 short int 128 dvec.h
Iu16vec8 16 8 unsigned short int 128 dvec.h
Is32vec4 32 4 int 128 dvec.h
Iu32vec4 32 4 unsigned int 128 dvec.h
I64vec2 64 2 __int64 128 dvec.h
F32vec4 32 4 float 128 fvec.h
F64vec2 64 2 double 128 dvec.h
F32vec8 32 8 float 256 dvec.h
F64vec4 64 4 double 256 dvec.h

Optimizing software in C++: It is not recommended to use the 64-bit vectors in ivec.h because these are incompatible with floating point code. If you do use the 64-bit vectors then you have to execute _mm_empty() after the 64-bit vector operations and before any floating point code. The 128-bit vectors do not have this problem. The 256-bit vectors require that the AVX instruction set is supported by the microprocessor and the operating system.

Most of the integer vector class do not support vector multiplication operations and all integer vector class do not support vector division operations, while float and double vector class do support vector addition, subtraction, multiplication and division. Here is an example to implement your own integer (32bit) division operator using the scalar division. And of course, using scalar operation will be slower.

C++
static inline Is32vec4 operator / (Is32vec4 const & a, Is32vec4 const & b) 
{
    Is32vec4 ans;
    ans[0] = a[0] / b[0];
    ans[1] = a[1] / b[1];
    ans[2] = a[2] / b[2];
    ans[3] = a[3] / b[3];
    return ans;
}

Helper Array Class

I have written 3 simple array class (for integer, float and double, respectively) to ease the use of array of SIMD vectors. Note the vector term used extensively in the article, actually refers to the 128 bit vector of simple types like integer, float and so on; it is not referring to the STL vector class. The most notable function of the array class is setUnusedValue. When there is a need to do division and the number of elements in the array is not multiple of 4 (32 bit POD), clearly we need to call setUnusedValue to set the last unused elements to 1 to avoid the division by zero error. Before using any SIMD instructions, you need to check for its support but every processor today should have SSE2 support.

C++
class Int128Array
{
public:
    //! Constructor
    Int128Array(int size32)
        :
        arr(NULL),
        size128(0),
        size32(0),
        remainder(0)
    {
        // divide by 4, to get the size in 128bit.
        size128 = (size32) >> 2;
        // calculate the remainder
        remainder = size32%4; 

#ifdef _DEBUG
        assert(remainder<4);
#endif // _DEBUG

        // if there is remainder, 
        // add 1 to the 128bit m_arraysize
        if(remainder!=0) 
            ++size128;
        
        this->size32 = size32;

        size_t fullsize = (size128) * 16;

        arr = (Is32vec4 *)(_aligned_malloc((size_t)(fullsize), 16));

        Is32vec4 initialised(1,1,1,1);

        for(size_t i=0; i<size128; ++i)
        {
             arr[i] = initialised;
        }
    }

    //! Constructor
    Int128Array(int size32, int initialize)
        :
        arr(NULL),
        size128(0),
        size32(0),
        remainder(0)
    {
        // divide by 4, to get the size in 128bit.
        size128 = (size32) >> 2;
        // calculate the remainder
        remainder = size32%4; 

#ifdef _DEBUG
        assert(remainder<4);
#endif // _DEBUG

        // if there is remainder, 
        // add 1 to the 128bit m_arraysize
        if(remainder!=0) 
            ++size128;

        this->size32 = size32;

        size_t fullsize = (size128) * 16;

        arr = (Is32vec4 *)(_aligned_malloc((size_t)(fullsize), 16));

        Is32vec4 initialised(initialize,initialize,initialize,initialize);

        for(size_t i=0; i<size128; ++i)
        {
            arr[i] = initialised;
        }
    }

    //! Destructor
    ~Int128Array()
    {
        if(arr!=NULL)
        {
            _aligned_free(arr);
            arr=NULL;
        }
    }

    //! overloaded operator to return 32 bit integer
    int& operator()(size_t i)
    {
#ifdef _DEBUG
        assert(i<size32);
#endif // _DEBUG

        int a = i >> 2; // divide by 4
        int b = a << 2; // multiply by 4 again

        int rem = i - b; // compute the remainder

        return arr[a][rem];
    }

    //! overloaded operator to return 128 bit vector
    Is32vec4& operator[](size_t i)
    {
#ifdef _DEBUG
        assert(i<size128);
#endif // _DEBUG

        return arr[i];
    }

    void setUnusedValue(int num)
    {
        if(size128==0||remainder==0)
            return;

        size_t unused = 4 - remainder;

        if(unused>=1)
            arr[size128-1][3] = num;
        if(unused>=2)
            arr[size128-1][2] = num;
        if(unused>=3)
            arr[size128-1][1] = num;
    }

    size_t GetSize128() { return size128; }
    size_t GetSize32() { return size32; }
    size_t GetRemainder() { return remainder; }

private:
    //! point to array of 128 bit elements
    Is32vec4* arr;
    //! size of array in 128 bit chunks
    size_t size128;
    //! size of array in 32 bit chunks (might be (32 * x) < (128/4 * x)
    size_t size32;
    //! Remainder of 32 bit element at the last 128 bit element.
    size_t remainder;
};

The overloaded [] operator returns a 128 bit vector while the () operator returns a 32 bit integer. I do not show the code for float and double array as they are similar in nature to the integer array. I have made a C++/CLI wrapper class for each array class, so that C# and Visual Basic application can access these array class but it is advisable to batch the SIMD operations in C++ because the managed to native transition is pretty expensive, we want to reduce the native calls in our .NET application as few as possible. The array wrapper class merely provides a way for C# or other .NET language to initialize the SIMD array before C++ processing and to retrieve the results after C++ processing. The below example shows how to call the overloaded operators in C++.

C++
using namespace NativeSIMDArray;

Int128Array intArr(5, 20);
intArr.setUnusedValue(6);

Is32vec4 div(3,3,3,3);

for(size_t i=0; i<intArr.GetSize128(); ++i)
{
    intArr[i] = intArr[i] % div;
}
for(size_t i=0; i<intArr.GetSize128(); ++i)
{
    for(size_t j=0; j<4; ++j)
        cout << intArr[i][j] <<endl;
}
for(size_t k=0; k<intArr.GetSize32(); ++k)
{
    intArr(k) = 1;
    cout << intArr(k) <<endl;
}

Drawing A Circle

We will use the F32vec4 class (128bit float vector) to draw a circle. I'll show a plain scalar C++ code to do it first.

Draw a green circle
Pixel within the circle

The code uses Pythagorean theorem to determine if the pixel falls within the circle. Using X and Y coordinates from the center of the circle, we calculate the hypotenuse, using sqrt(X2+Y2). The hypotenuse is the pixel distance from the center point. If it is smaller than the radius, then it is within the circle and the pixel should be set to green color.

C++
// Draw Circle without optimization
void CScratchPadDlg::OnPaint()
{
    // ... non-relevant GDI+ code is not shown here

    if( !pixelsCanvas )
        return;

    UINT col = 0;
    int stride = bitmapDataCanvas.Stride >> 2;

    float fPointX = 100.0f; // x coordinates of the circle center
    float fPointY = 100.0f; // y coordinates of the circle center
    float fRadius = 40.0f; // radius of the circle center

    float fy=0.0f;
    float fx=0.0f;

    UINT color = 0xff00ff00;
    for(UINT row = 0, fy=0.0f; row < bitmapDataCanvas.Height; row+=1, fy+=1.0f)
    {
        for(col = 0, fx=0.0f; col < bitmapDataCanvas.Width; col+=1, fx+=1.0f)
        {
            // calculate the index of destination pixel array
            int index = row * stride + col;

            // Subtract center X from the pixel X coordinates
            float X = fx - fPointX;
            // Subtract center Y from the pixel Y coordinates
            float Y = fy - fPointY;

            // compute the square of X, that is X * X = X to power of 2
            X = X * X;
            // compute the square of Y, that is Y * Y = Y to power of 2
            Y = Y * Y;

            // Add up the X square and Y square
            float hypSq = X + Y;

            // Find the hypotenuse by computing square root
            float hyp = std::sqrt(hypSq);

            UINT origPixel = pixelsCanvas[index];
            if(hyp<=fRadius)
            {
                pixelsCanvas[index] = color;
            }
            else
            {
                pixelsCanvas[index] = origPixel;
            }
        }
    }

    graphics.DrawImage(m_pSrcBitmap,0,0,m_pSrcBitmap->GetWidth(),m_pSrcBitmap->GetHeight());
}

I have to admit this is an inefficient way to draw circle. A better and faster way is to draw a circle with a stroke and then use floodfill to fill the circle with a solid color. And there is also no antialiasing at the circle boundary. Our demo at the end requires circles to be drawn in such a way.

Below is the version written with intrinsic (not using F32vec4 yet). I will not try to explain this intrinsic version but I'll explain the F32vec4 version. Both version are similar.

C++
// Draw Circle with SIMD intrinsic
void CScratchPadDlg::OnPaint()
{
    // ... non-relevant GDI+ code is not shown here

    UINT col = 0;
    int stride = bitmapDataCanvas.Stride >> 2;

    float fPointX = 100.0f; // x coordinates of the circle center
    float fPointY = 100.0f; // y coordinates of the circle center
    float fRadius = 40.0f; // radius of the circle center

    // vector of 4 x coordinates of the circle center
    __m128 vecPointX = _mm_set_ps1(fPointX);
    // vector of 4 y coordinates of the circle center
    __m128 vecPointY = _mm_set_ps1(fPointY);
    // vector of 4 radius of the circle center
    __m128 vecRadius = _mm_set_ps1(fRadius);

    float fy=0.0f;
    float fx=0.0f;

    UINT color = 0xff00ff00;
    for(UINT row = 0, fy=0.0f; row < bitmapDataCanvas.Height; row+=1, fy+=1.0f)
    {
        for(col = 0, fx=0.0f; col < bitmapDataCanvas.Width; col+=4, fx+=4.0f)
        {
            // calculate the index of destination pixel array
            int index = row * stride + col;

            // vector of X coordinates of the 4 pixels, it is inverse of of little endian
            __m128 vecX= _mm_set_ps(fx+3.0f, fx+2.0f, fx+1.0f, fx+0.0f);
            // vector of Y coordinates of the 4 pixels
            __m128 vecY = _mm_set_ps1(fy);

            // Subtract center X from the pixel X coordinates
            vecX = _mm_sub_ps(vecX, vecPointX);

            // Subtract center Y from the pixel Y coordinates
            vecY = _mm_sub_ps(vecY, vecPointY);

            // compute the square of X, that is X * X = X to power of 2
            vecX = _mm_mul_ps(vecX, vecX);
            // compute the square of Y, that is Y * Y = Y to power of 2
            vecY = _mm_mul_ps(vecY, vecY);

            // Add up the X square and Y square
            __m128 vecHypSq = _mm_add_ps(vecX, vecY);

            // Find the hypotenuse by computing square root
            __m128 vecHyp = _mm_sqrt_ps(vecHypSq);

            // Generate the mask for condition of hypotenuse < radius
            __m128 mask = _mm_cmple_ps(vecHyp, vecRadius);

            // all 4 pixel in mask vector, falls within the width
            if(col+3<bitmapDataCanvas.Width)
            {
                UINT origPixel = pixelsCanvas[index+0];
                pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                origPixel = pixelsCanvas[index+1];
                pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
                pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);

                origPixel = pixelsCanvas[index+2];
                pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
                pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);

                origPixel = pixelsCanvas[index+3];
                pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
                pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
            }
            else // all 4 pixel in mask vector do not falls within the width: have to test 1 by 1.
            {
                UINT origPixel = pixelsCanvas[index+0];
                pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                if(col+1<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+1];

                    pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
                    pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);
                }

                if(col+2<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+2];

                    pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
                    pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);
                }

                if(col+3<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+3];

                    pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
                    pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
                }
            }
        }
    }

    graphics.DrawImage(m_pSrcBitmap,0,0,m_pSrcBitmap->GetWidth(),m_pSrcBitmap->GetHeight());
}

Below is the code written with F32vec4 class provided by Microsoft Visual Studio. In fact, I wrote the vector class version first, then wrote the scalar and SIMD intrinsic version. It's easier to convert the vector class version to the intrinsic version: I only have to look up the intrinsic name inside the vector class overloaded operators and replace them accordingly.

C++
// Draw Circle with SIMD class
void CScratchPadDlg::OnPaint()
{
    // ... non-relevant GDI+ code is not shown here

    UINT col = 0;
    int stride = bitmapDataCanvas.Stride >> 2;

    float fPointX = 100.0f; // x coordinates of the circle center
    float fPointY = 100.0f; // y coordinates of the circle center
    float fRadius = 40.0f; // radius of the circle center

    // vector of 4 x coordinates of the circle center
    F32vec4 vecPointX(fPointX);
    // vector of 4 y coordinates of the circle center
    F32vec4 vecPointY(fPointY);
    // vector of 4 radius of the circle center
    F32vec4 vecRadius(fRadius);

    float fy=0.0f;
    float fx=0.0f;

    UINT color = 0xff00ff00;
    for(UINT row = 0, fy=0.0f; row < bitmapDataCanvas.Height; row+=1, fy+=1.0f)
    {
        for(col = 0, fx=0.0f; col < bitmapDataCanvas.Width; col+=4, fx+=4.0f)
        {
            // calculate the index of destination pixel array
            int index = row * stride + col;

            // vector of X coordinates of the 4 pixels, it is inverse of of little endian
            F32vec4 vecX(fx+3.0f, fx+2.0f, fx+1.0f, fx+0.0f);
            // vector of Y coordinates of the 4 pixels
            F32vec4 vecY((float)(fy));

            // Subtract center X from the pixel X coordinates
            vecX -= vecPointX;
            // Subtract center Y from the pixel Y coordinates
            vecY -= vecPointY;

            // compute the square of X, that is X * X = X to power of 2
            vecX = vecX * vecX;
            // compute the square of Y, that is Y * Y = Y to power of 2
            vecY = vecY * vecY;

            // Add up the X square and Y square
            F32vec4 vecHypSq = vecX + vecY;

            // Find the hypotenuse by computing square root
            F32vec4 vecHyp = sqrt(vecHypSq);

            // Generate the mask for condition of hypotenuse < radius
            F32vec4 mask = cmple(vecHyp, vecRadius);

            // all 4 pixel in mask vector, falls within the width
            if(col+3<bitmapDataCanvas.Width)
            {
                UINT origPixel = pixelsCanvas[index+0];
                pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                origPixel = pixelsCanvas[index+1];
                pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
                pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);

                origPixel = pixelsCanvas[index+2];
                pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
                pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);

                origPixel = pixelsCanvas[index+3];
                pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
                pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
            }
            else // all 4 pixel in mask vector do not falls within the width: have to test 1 by 1.
            {
                UINT origPixel = pixelsCanvas[index+0];
                pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                if(col+1<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+1];

                    pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
                    pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);
                }

                if(col+2<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+2];

                    pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
                    pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);
                }

                if(col+3<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+3];

                    pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
                    pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
                }
            }
        }
    }

    graphics.DrawImage(m_pSrcBitmap,0,0,m_pSrcBitmap->GetWidth(),m_pSrcBitmap->GetHeight());
}

The benchmark result is as below.

Intel i7 CPU (Quad Core)
Debug Mode

Scalar : 12ms
SSE Intrinsic : 5ms
SSE vector class : 38ms.

Release Mode

Scalar : 2ms
SSE Intrinsic : 1ms
SSE vector class : 1ms.
Core 2 Duo CPU (Dual Core)
Debug Mode

Scalar : 18ms
SSE Intrinsic : 5ms
SSE vector class : 57ms.

Release Mode

Scalar : 4ms
SSE Intrinsic : 1ms
SSE vector class : 1ms.

As the reader can see, the benchmark result for the SSE vector in debug build is very bad and is not acceptable for our demo application which animates the expanding circle on a interval of 30 milliseconds. The reason is probably because in debug build, the inline constructor and methods are not inline'd by the compiler. I work around it by always running the program in release mode. Afterall, it is a simple graphics program which does not need debugging, I figure out what is wrong by observing.

Now I begin to explain the previous code snippet.

C++
float fPointX = 100.0f; // x coordinates of the circle center
float fPointY = 100.0f; // y coordinates of the circle center
float fRadius = 40.0f; // radius of the circle center

// vector of 4 x coordinates of the circle center
F32vec4 vecPointX(fPointX);
// vector of 4 y coordinates of the circle center
F32vec4 vecPointY(fPointY);
// vector of 4 radius of the circle center
F32vec4 vecRadius(fRadius);

This code merely initialize the 2 vector of X coordinates and Y coordinates of the circle center. The 3rd vector is circle radius.

C++
float fy=0.0f;
float fx=0.0f;

UINT color = 0xff00ff00;
for(UINT row = 0, fy=0.0f; row < bitmapDataCanvas.Height; row+=1, fy+=1.0f)
{
    for(col = 0, fx=0.0f; col < bitmapDataCanvas.Width; col+=4, fx+=4.0f)
    {
        // calculate the index of destination pixel array
        int index = row * stride + col;

        // vector of X coordinates of the 4 pixels, it is inverse of of little endian
        F32vec4 vecX(fx+3.0f, fx+2.0f, fx+1.0f, fx+0.0f);
        // vector of Y coordinates of the 4 pixels
        F32vec4 vecY((float)(fy));

This is a nested for loop. The outer loop iterates the rows while the inner loop iterates every 4 columns. vecX and vecY are the columns and rows of the pixels. You may have noticed that the float arguments to vecX constructor is reversed. This is due to little endianness of Intel processor. Below is what it looks like if they are not reversed.

Circle with columns of 4 pixels inverted
C++
// Subtract center X from the pixel X coordinates
vecX -= vecPointX;
// Subtract center Y from the pixel Y coordinates
vecY -= vecPointY;

// compute the square of X, that is X * X = X to power of 2
vecX = vecX * vecX;
// compute the square of Y, that is Y * Y = Y to power of 2
vecY = vecY * vecY;

// Add up the X square and Y square
F32vec4 vecHypSq = vecX + vecY;

// Find the hypotenuse by computing square root
F32vec4 vecHyp = sqrt(vecHypSq);

Now we subtract the circle center from the XY coordinates of the pixels to get where the pixels are in relative to te circle center. We may get a negative X or Y result but we do not have to convert them to the absolute value because the next multiplication operation to compute for the square value (eg, X2, Y2), will convert negative values to positive values because multiplication of 2 negative numbers gives a positive result. The rest of the code computes the hypotenuse.

C++
// Generate the mask for condition of hypotenuse < radius
F32vec4 mask = cmple(vecHyp, vecRadius);

The first statement generates a mask if the hypotenuse of the triangle is less than the circle radius. The mask is all ones (0xffffffff) when the condition is true, else is all zeroes

C++
// all 4 pixel in mask vector, falls within the width
if(col+3<bitmapDataCanvas.Width)
{
    UINT origPixel = pixelsCanvas[index+0];
    pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
    pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

    origPixel = pixelsCanvas[index+1];
    pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
    pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);

    origPixel = pixelsCanvas[index+2];
    pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
    pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);

    origPixel = pixelsCanvas[index+3];
    pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
    pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
}
else // all 4 pixel in mask vector do not falls within the width: have to test 1 by 1.
{
    UINT origPixel = pixelsCanvas[index+0];
    pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
    pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

    if(col+1<bitmapDataCanvas.Width)
    {
        origPixel = pixelsCanvas[index+1];

        pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
        pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);
    }

    if(col+2<bitmapDataCanvas.Width)
    {
        origPixel = pixelsCanvas[index+2];

        pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
        pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);
    }

    if(col+3<bitmapDataCanvas.Width)
    {
        origPixel = pixelsCanvas[index+3];

        pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
        pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
    }
}

The first if condition test whether column + 3 is less than the image width. If it is, it will assign the color to the pixelCanvas[i] according if the mask is 0xffffffff. If mask is all zeros, the original color will be applied to pixelCanvas[i] instead. The else block executes when column + 3 is more than the image width and we have to test every pixel whether it is within the boundary before assigning the color to pixelCanvas[i].

Out of curiosity, I modified the earlier code snippet to draw a circle with a gradient. You noticed the gradients color 'spiked' into the center. This is okay because our final demo does not involve gradients.

Draw a circle with gradient
C++
// Draw Circle with gradient
void CScratchPadDlg::OnPaint()
{
    // ... non-relevant GDI+ code is not shown here

    UINT col = 0;
    int stride = bitmapDataCanvas.Stride >> 2;

    float fPointX = 50.0f; // x coordinates of the circle center
    float fPointY = 50.0f; // Y coordinates of the circle center
    float fRadius = 40.0f; // radius of the circle center

    // vector of 4 x coordinates of the circle center
    F32vec4 vecPointX(fPointX, fPointX, fPointX, fPointX);
    // vector of 4 x coordinates of the circle center
    F32vec4 vecPointY(fPointY, fPointY, fPointY, fPointY);
    // vector of 4 radius of the circle center
    F32vec4 vecRadius(fRadius, fRadius, fRadius, fRadius);

    float fy=0.0f;
    float fx=0.0f;

    UINT color = 0xff0000ff;
    UINT color2 = 0xff00ffff;

    UINT* arrGrad = GenGrad(color, color2, 40);

    for(UINT row = 0, fy=0.0f; row < bitmapDataCanvas.Height; row+=1, fy+=1.0f)
    {
        for(col = 0, fx=0.0f; col < bitmapDataCanvas.Width; col+=4, fx+=4.0f)
        {
            // calculate the index of destination pixel array
            int index = row * stride + col;

            // vector of X coordinates of the 4 pixels, it is inverse of of little endian
            F32vec4 vecX(fx+3.0f, fx+2.0f, fx+1.0f, fx+0.0f);
            // vector of Y coordinates of the 4 pixels
            F32vec4 vecY(fy, fy, fy, fy);

            // Subtract center X from the pixel X coordinates
            vecX -= vecPointX;
            // Subtract center Y from the pixel Y coordinates
            vecY -= vecPointY;

            // compute the square of X, that is X * X = X to power of 2
            vecX = vecX * vecX;
            // compute the square of Y, that is Y * Y = Y to power of 2
            vecY = vecY * vecY;

            // Add up the X square and Y square
            F32vec4 vecHypSq = vecX + vecY;

            // Find the hypotenuse by computing square root
            F32vec4 vecHyp = sqrt(vecHypSq);

            // Generate the mask for condition of hypotenuse < radius
            F32vec4 mask = cmple(vecHyp, vecRadius);

            // all 4 pixel in mask vector, falls within the width
            if(col+3<bitmapDataCanvas.Width)
            {
                UINT origPixel = pixelsCanvas[index+0];

                pixelsCanvas[index+0] = arrGrad[ (int)((__m128(vecHyp)).m128_f32[0]) ] & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                origPixel = pixelsCanvas[index+1];
                pixelsCanvas[index+1] = arrGrad[ (int)((__m128(vecHyp)).m128_f32[1]) ] & (__m128(mask)).m128_u32[1];
                pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);

                origPixel = pixelsCanvas[index+2];
                pixelsCanvas[index+2] = arrGrad[ (int)((__m128(vecHyp)).m128_f32[2]) ] & (__m128(mask)).m128_u32[2];
                pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);

                origPixel = pixelsCanvas[index+3];
                pixelsCanvas[index+3] = arrGrad[ (int)((__m128(vecHyp)).m128_f32[3]) ] & (__m128(mask)).m128_u32[3];
                pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
            }
            else // all 4 pixel in mask vector do not falls within the width: have to test 1 by 1.
            {
                UINT origPixel = pixelsCanvas[index+0];

                pixelsCanvas[index+0] = arrGrad[ (int)((__m128(vecHyp)).m128_f32[0]) ] & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                if(col+1<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+1];

                    pixelsCanvas[index+1] = arrGrad[ (int)((__m128(vecHyp)).m128_f32[1]) ] & (__m128(mask)).m128_u32[1];
                    pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);
                }

                if(col+2<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+2];

                    pixelsCanvas[index+2] = arrGrad[ (int)((__m128(vecHyp)).m128_f32[2]) ] & (__m128(mask)).m128_u32[2];
                    pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);
                }

                if(col+3<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+3];

                    pixelsCanvas[index+3] = arrGrad[ (int)((__m128(vecHyp)).m128_f32[3]) ] & (__m128(mask)).m128_u32[3];
                    pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
                }
            }
        }
    }

    delete [] arrGrad;

    graphics.DrawImage(m_pSrcBitmap,0,0,m_pSrcBitmap->GetWidth(),m_pSrcBitmap->GetHeight());
}

UINT* CDrawCircleDlg::GenGrad(UINT a, UINT b, UINT len)
{
    int r1=(a&0xff0000)>>16,g1=(a&0xff00)>>8,b1=(a&0xff); //Any start color
    int r2=(b&0xff0000)>>16,g2=(b&0xff00)>>8,b2=(b&0xff); //Any start color

    UINT* arr = new UINT[len+1];

    for(UINT i=0;i<len+1;i++)
    { 
        int r,g,b;
        r = r1 + (i * (r2-r1) / len);
        g = g1 + (i * (g2-g1) / len);
        b = b1 + (i * (b2-b1) / len);
        
        arr[i] = 0xff000000 | r << 16 | g << 8 | b;
    }

    return arr;
}

Circle Cover Effect

Circle Cover Effect

In this section, we are going to use the code which we developed previously for drawing circle to implement a circle cover effect where a circle keep expanding and exposing the image underneath. So how does the code know when to stop expanding (that is when the entire image is already covered/replaced)? Look at the image below. When you mouse-click on the client area, the distance from the 4 corner are calculated and the longest distance (red line) is selected as the maximum radius to expand.

Distance from 4 corners

Tip: you can replace the flickr1.jpg and flickr2.jpg in the res folder with your own images (just rename them to be flickr1.jpg and flickr2.jpg but their dimension have to be equal), the CircleCover application will automatically resize its client dimension to be the same as the 1st embedded image on startup.

If the

Hexagon Cover Effect

Hexagon Cover Effect

The hexagon cover effect is similar to the circle cover application. The only difference is the expanding edge is replaced by little hexagons. As with the CircleCover application, you can replace the images before compilation.

Hexagon center within circle radius?

This is how it works. The difference between each hexagon center and the mouse point is calculated. If the difference is greater than the hexagon, then the hexagon should be displayed. If the difference is smaller, then a precomputed mask (which store the distance of each pixel from the hexagon center) is used to determine whether the pixel underneath should be displayed. I did not convert this application to use SSE, simply because the speed gain will not be much as the number of distance to compute between each hexagon center and the mouse point, are not many. And the distance of each pixel from the hexagon center is precomputed as a mask. I leave it as a exercise for the reader to convert this code to use SSE. The source code is heavily commented.

Conclusion

We has seen how to use the SIMD vector class provided by Visual Studio for code clarity and code speedup. However, we have also seen the speed penalty in debug build is quite bad (40x). Only simple operations like addition, subtraction, multiplication, division and square root are provided by SIMD vector class. There is no trigonometry functions (like sine and cosine). This is why I restricted my effect to use Pythagorean theorem. The source code is hosted at Codeplex.

History

  • 5th March 2012 : Added how to write integer division operation using scalar division.
  • 28th February 2012 : Updated the article with table of the vector classes and their header files.
  • 19th February 2012 : First release

Reference

License

This article, along with any associated source code and files, is licensed under The Microsoft Public License (Ms-PL)


Written By
Software Developer (Senior)
Singapore Singapore
Shao Voon is from Singapore. His interest lies primarily in computer graphics, software optimization, concurrency, security, and Agile methodologies.

In recent years, he shifted focus to software safety research. His hobby is writing a free C++ DirectX photo slideshow application which can be viewed here.

Comments and Discussions

Discussions on this specific version of this article. Add your comments on how to improve this article here. These comments will not be visible on the final published version of this article.