Click here to Skip to main content
Click here to Skip to main content

Introduction to SSE Programming

By , 10 Jul 2003
 
<!-- Add the rest of your HTML here -->

Introduction

The Intel Streaming SIMD Extensions technology enhance the performance of floating-point operations. Visual Studio .NET 2003 supports a set of SSE Intrinsics which allow the use of SSE instructions directly from C++ code, without writing the Assembly instructions. MSDN SSE topics [2] may be confusing for the programmers who are not familiar with the SSE Assembly progamming. However, reading the Intel Software manuals [1] together with MSDN gives the opportunity to understand the basics of SSE programming.

SIMD is a single-instruction, multiple-data (SIMD) execution model. Consider the following programming task: computing of the square root of each element in a long floating-point array. The algorithm for this task may be written by such way:

for each  f in array
    f = sqrt(f)
Let's be more specific:
for each  f in array
{
    load f to the floating-point register
    calculate the square root
    write the result from the register to memory
}
Processor with the Intel SSE support have eight 128-bit registers, each of which may contain 4 single-precision floating-point numbers. SSE is a set of instructions which allow to load the floating-point numbers to 128-bit registers, perform the arithmetic and logical operations with them and write the result back to memory. Using SSE technology, algorithms may be written as:
for each  4 members in array
{
    load 4 members to the SSE register
    calculate 4 square roots in one operation
    write the result from the register to memory
}
The C++ programmer writing a program using SSE Intrinsics doesn't care about registers. He has a 128-byte __m128 type and a set of functions to perform the arithmetic and logical operations. It's up to the C++ compiler to decide which SSE register to use and to make code optimizations. SSE technology may be used when some operation is done with each element of a long floating-point arrays.

SSE Programming Details

Include Files

All SSE instructions and __m128 data type are defined in xmmintrin.h file:
#include <xmmintrin.h>
Since SSE instructions are compiler intrinsics and not functions, there are no lib-files.

Data Alignment

Each float array processed by SSE instructions should have 16 byte alignment. A static array is declared using the __declspec(align(16)) keyword:
__declspec(align(16)) float m_fArray[ARRAY_SIZE];
Dynamic array should be allocated using new _aligned_malloc function:
m_fArray = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);
Array allocated by the _aligned_malloc function is released using the _aligned_free function:
_aligned_free(m_fArray);

__m128 Data Type

Variables of this type are used as SSE instructions operands. They should not be accessed directly. Variables of type _m128 are automatically aligned on 16-byte boundaries.

Detection of SSE Support

SSE instructions may be used if they are supported by the processor. The Visual C++ CPUID sample [4] shows how to detect support of the SSE, MMX and other processor features. It is done using the cpuid Assembly command. See details in this sample and in the Intel Software manuals [1].

SSETest Demo Project

SSETest project is a dialog-based application which makes the following calculation with three float arrays:
fResult[i] = sqrt( fSource1[i]*fSource1[i] + fSource2[i]*fSource2[i] ) + 0.5

i = 0, 1, 2 ... ARRAY_SIZE-1
ARRAY_SIZE is defined as 30000. Source arrays are filled using sin and cos functions. The Waterfall chart control written by Kris Jearakul [3] is used to show the source arrays and the result of calculations. Calculation time (ms) is shown in the dialog. Calculation may be done using one of three possible ways:
  • C++ code;
  • C++ code with SSE Intrinsics;
  • Inline Assembly with SSE instructions.
C++ function:
void CSSETestDlg::ComputeArrayCPlusPlus(
          float* pArray1,                   // [in] first source array
          float* pArray2,                   // [in] second source array
          float* pResult,                   // [out] result array
          int nSize)                        // [in] size of all arrays
{

    int i;

    float* pSource1 = pArray1;
    float* pSource2 = pArray2;
    float* pDest = pResult;

    for ( i = 0; i < nSize; i++ )
    {
        *pDest = (float)sqrt((*pSource1) * (*pSource1) + (*pSource2)
                 * (*pSource2)) + 0.5f;

        pSource1++;
        pSource2++;
        pDest++;
    }
}
Now let's rewrite this function using the SSE Instrinsics. To find the required SSE Instrinsics I use the following way:
  • Find Assembly SSE instruction in Intel Software manuals [1]. First I look for this instruction in Volume 1, Chapter 9, and after this find the detailed Description in Volume 2. This description contains also appropriate C++ Intrinsic name.
  • Search for SSE Intrinsic name in the MSDN Library.
Some SSE Intrinsics are composite and cannot be found by this way. They should be found directly in the MSDN Library (descriptions are very short but readable). The results of such search may be shown in the following table:

Required Function Assembly Instruction SSE Intrinsic
Assign float value to 4 components of 128-bit value movss + shufps _mm_set_ps1 (composite)
Multiply 4 float components of 2 128-bit values mulps _mm_mul_ps
Add 4 float components of 2 128-bit values addps _mm_add_ps
Compute the square root of 4 float components in 128-bit values sqrtps _mm_sqrt_ps

C++ function with SSE Intrinsics:

void CSSETestDlg::ComputeArrayCPlusPlusSSE(
          float* pArray1,                   // [in] first source array
          float* pArray2,                   // [in] second source array
          float* pResult,                   // [out] result array
          int nSize)                        // [in] size of all arrays
{
    int nLoop = nSize/ 4;

    __m128 m1, m2, m3, m4;

    __m128* pSrc1 = (__m128*) pArray1;
    __m128* pSrc2 = (__m128*) pArray2;
    __m128* pDest = (__m128*) pResult;


    __m128 m0_5 = _mm_set_ps1(0.5f);        // m0_5[0, 1, 2, 3] = 0.5

    for ( int i = 0; i < nLoop; i++ )
    {
        m1 = _mm_mul_ps(*pSrc1, *pSrc1);        // m1 = *pSrc1 * *pSrc1
        m2 = _mm_mul_ps(*pSrc2, *pSrc2);        // m2 = *pSrc2 * *pSrc2
        m3 = _mm_add_ps(m1, m2);                // m3 = m1 + m2
        m4 = _mm_sqrt_ps(m3);                   // m4 = sqrt(m3)
        *pDest = _mm_add_ps(m4, m0_5);          // *pDest = m4 + 0.5
        
        pSrc1++;
        pSrc2++;
        pDest++;
    }
}
This doesn't show the function using inline Assembly. Anyone who is interested may read it in the demo project. Calculation times on my computer:
  • C++ code - 26 ms
  • C++ with SSE Intrinsics - 9 ms
  • Inline Assembly with SSE instructions - 9 ms
Execution time should be estimated in the Release configuration, with compiler optimizations.

SSESample Demo Project

SSESample project is a dialog-based application which makes the following calculation with float array:
fResult[i] = sqrt(fSource[i]*2.8)

i = 0, 1, 2 ... ARRAY_SIZE-1
The program also calculates the minimum and maximum values in the result array. ARRAY_SIZE is defined as 100000. Result array is shown in the listbox. Calculation time (ms) for each way is shown in the dialog:
  • C++ code - 6 ms on my computer;
  • C++ code with SSE Intrinsics - 3 ms;
  • Inline Assembly with SSE instructions - 2 ms.

Assembly code performs better because of intensive using of the SSX registers. However, usually C++ code with SSE Intrinsics performs like Assembly code or better, because it is difficult to write an Assembly code which runs faster than optimized code generated by C++ compiler.

C++ function:

// Input: m_fInitialArray
// Output: m_fResultArray, m_fMin, m_fMax
void CSSESampleDlg::OnBnClickedButtonCplusplus()
{
    m_fMin = FLT_MAX;
    m_fMax = FLT_MIN;

    int i;

    for ( i = 0; i < ARRAY_SIZE; i++ )
    {
        m_fResultArray[i] = sqrt(m_fInitialArray[i]  * 2.8f);

        if ( m_fResultArray[i] < m_fMin )
            m_fMin = m_fResultArray[i];

        if ( m_fResultArray[i] > m_fMax )
            m_fMax = m_fResultArray[i];
    }
}
C++ function with SSE Intrinsics:
// Input: m_fInitialArray
// Output: m_fResultArray, m_fMin, m_fMax
void CSSESampleDlg::OnBnClickedButtonSseC()
{
    __m128 coeff = _mm_set_ps1(2.8f);      // coeff[0, 1, 2, 3] = 2.8
    __m128 tmp;

    __m128 min128 = _mm_set_ps1(FLT_MAX);  // min128[0, 1, 2, 3] = FLT_MAX
    __m128 max128 = _mm_set_ps1(FLT_MIN);  // max128[0, 1, 2, 3] = FLT_MIN

    __m128* pSource = (__m128*) m_fInitialArray;
    __m128* pDest = (__m128*) m_fResultArray;

    for ( int i = 0; i < ARRAY_SIZE/4; i++ )
    {
        tmp = _mm_mul_ps(*pSource, coeff);      // tmp = *pSource * coeff
        *pDest = _mm_sqrt_ps(tmp);              // *pDest = sqrt(tmp)

        min128 =  _mm_min_ps(*pDest, min128);
        max128 =  _mm_max_ps(*pDest, max128);

        pSource++;
        pDest++;
    }

    // extract minimum and maximum values from min128 and max128
    union u
    {
        __m128 m;
        float f[4];
    } x;

    x.m = min128;
    m_fMin = min(x.f[0], min(x.f[1], min(x.f[2], x.f[3])));

    x.m = max128;
    m_fMax = max(x.f[0], max(x.f[1], max(x.f[2], x.f[3])));
}

Sources

  1. Intel Software manuals.
  2. MSDN, Streaming SIMD Extensions (SSE). http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/vcrefstreamingsimdextensions.asp
  3. Waterfall chart control written by Kris Jearakul. http://www.codeguru.com/controls/Waterfall.shtml
  4. Microsoft Visual C++ CPUID sample. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/html/vcsamcpuiddeterminecpucapabilities.asp
  5. Matt Pietrek. Under The Hood. February 1998 issue of Microsoft Systems Journal. http://www.microsoft.com/msj/0298/hood0298.aspx
<!------------------------------- That's it! --------------------------->

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Alex Fr
Software Developer
Israel Israel
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
BugCode modificationmemberLuc Morin31-Oct-11 7:21 
With faster machine, the current code gives 0 msec. Here is a simple modification:
 
--------->>>>>>> In class CTimeCounter
float GetExecutionTime()
{
LARGE_INTEGER nEndTime;
float nCalcTime;
 
QueryPerformanceCounter(&nEndTime);
nCalcTime = (float)(nEndTime.QuadPart - m_nBeginTime.QuadPart) *
((float)1000)/(float)m_nFreq.QuadPart;
 
delete this;
 
return nCalcTime;
}
 

 
// Show execution time (ms) ---->>>>don't forget declaration
void CSSESampleDlg::ShowTime(float nTime)
{
if ( nTime == 0 )
m_static_time.SetWindowText(_T(""));
else
{
CString s;
s.Format(_T("%4.2f"), nTime);
m_static_time.SetWindowText(s);
}
}
 
Luc
GeneralRe: Code modificationmemberAlex Fr31-Oct-11 8:00 
You are right, I use the same modification in my own projects.
Generalsimdgroupmetyouba22-Apr-10 22:04 
please
 
does any one have simd code for optimizing euclidean distance between 2 vectors or two 1d arrays
 
thanks
QuestionHow to make it more efficientlymemberShang Chieh, Chou15-Mar-09 16:46 
Hi:
After reading yor article, I knew how to write a simple sse code.
But if my source data is unsigned short.
How to make it more effiently?
Because after aligning the unsigned short data, it still need the same number count of for loop
such as:
to divide Source1 by Source2
it needs 10 count of for loop when using sse
unsigned short Source1[40]
unsigned short Source1[40]
 
but if I can pack it into float, maybe it can only need 5 count of loop
does it?
and how to make it?
thanks
AnswerRe: How to make it more efficientlymemberAlex Fr16-Mar-09 9:15 
Working with integers, you need MMX or SSE2, and not SSE. This is MMX introduction:
http://www.codeproject.com/KB/recipes/mmxintro.aspx[^]
 
SSE2 has the same capabilities of integer operations, but has larger registers.
Anyway, these technologies are used for huge arrays, there is no sence to use them for small amount of data. Also, on modern computers SSE and MMX do not give such significant performance boost, like on old Pentium III.
GeneralFaulty performance comparisonmemberJAF123456789020-Jul-07 8:38 
The SSE C++ and inline assembly timings include two optimizations; that due to SSE calculation s and that due to a factor of 4 loop unrolling. It will be enlightening to compare a loop unrolled native C++ timing with the other two methods.
 
Jeff
 
Jeff

GeneralRe: Faulty performance comparisonmemberAndyb197927-Jan-10 2:31 
Good point, I actually did this just now (a whole 3 years after your post!)
 
and the result on my PC is:
(Intel Quad-core Xeon, 1.8GHz)
 
C++ (Loop unrolled in blocks of 4): 14ms
Asm: 8ms
C++ Intrinsics: 8ms
 
Also I increased the array size to 1,000,000 to get a better time.
 
So yes, unrolling that loop does optimise the process (as C++ compiler is probably using SSE itself).
 
Now I also converted the application to process doubles not floats (as I require doubles) using SSE2 and the results are that the C++ loop unrolled now executes in 11ms vs 8ms for SSE2.
 
Hmmm. Well I'm doing something wrong.
GeneralRe: Faulty performance comparisonmemberAndyb197927-Jan-10 3:45 
.... In fact, the same operation is only marginally slower (15%) in C# when using loop unrolling Blush | :O
 
I realise this SSE example is not intended for performance, just for demo purposes but it seems there is more than meets the eye when optimizing SSE code for fast execution.
GeneralRe: Faulty performance comparisonmemberAndyb197927-Jan-10 6:07 
.. Wait Im smokin' something obviously. I didn't execute the full loop in C#
 
Used
 
for(int i = 0; i < ARRAY_SIZE/4; i++)
 
whereas I should've used
 
for(int i = 0; i < ARRAY_SIZE; i += 4)
 
Just getting it to the point where all implementations actually work (and give same result, no memory errors) and I'll post the results.
 
C# vs
C# Unrolled vs
C++ vs
C++ Unrolled
vs
C++/SIMD Intrinsics vs
SIMD Asm
GeneralSSE instructions!!memberminabeh14-Jun-07 1:37 
Hi....
is there any instruction to add 4 units of 32 bits in packed data type(__m128)???
 
Thanks..
GeneralA questionmemberlei_ma200318-Apr-07 22:59 
Could SSE be used in managed C++ application?
GeneralQuestionmembershaihnc22-Jul-05 10:41 
I have decided to use MSDN insturction for SSE2 programing.
 
I load up 8 , 16 bit short number into _m128i variable as follow:
 
_declspec(align(16)) short t1[100000];
_declspec(align(16)) short t2[100000];
__m128i temp1, temp2;
__m128i mul1,mul2;
 
temp1 = _mm_load_si128((__m128i*) ((short *) &t1[i]));
temp2 = _mm_load_si128((__m128i*) ((short *) &t2[i]));
 
then I use the _mm_mullo_epi16 function to get the multipication of my variables.
 
mul1 = _mm_mullo_epi16(temp1,temp2);
 
So now, I have the lower 16 bit of 32 bit result in mul1. now, I want to be able to add this 8 - 16 bit short values together or be able to seperate them.
 
I can not find any instruction whcih lets me do that Frown | :( Frown | :( Frown | :(
 
Can some one plzzzzzzzzzzzzzzzzz help me.
GeneralRe: Questionmemberpunkbuster17-Jan-06 18:55 
I know this is probably way late and you already figured it out, but did you try a union?
GeneralA questionmemberSachini M7-Jun-05 16:39 
Excellent article!
 
I'm very new to this topic and have a question. When using SSE, does the number of iterations of each loop always have to be a multiple of 4?
Lets say you need to do a check (if statement inside the loop) at every iteration, is there a way to use SSE? or is there any use using it?
 
Thanks in advance!
 
Regards,
Sachini
GeneralRe: A questionmemberAlex Fr8-Jun-05 2:43 
There is no any restriction on number of iterations, but every iteration works with 4 float numbers. This means, array size should be multiple of 4, and number of iterations is array size/4.
GeneralRe: A questionmemberpunkbuster16-Jan-06 18:46 
Well, actually, arrays do not need to be a multiple of 4. What you can do is for the portion that is a multiple of 4, do the SSE instructions, and with what's left over, do the regular way without SSE (which will be at max 3 iterations). This lets your optimization be dynamic across multiple array sizes.
 
So say you want to mess with an array of size 37. The first 36 you do with the SSE implementation, the last 1 you do with the normal implementation (without SSE).
 
It was a great question that wasn't addressed in the article. It's best practice to assume when creating such a function using SSE that it allows for arrays of any size.
 
---
punkbuster
Questionarray memory alignment ?membernot_happy011-May-05 20:08 

hi,
i am new to SSE, and is wondering, if _m128 data type is "auto-aligned" why
doing a new _m128[xx] is not aligned ? I seems to have to use _aligned_malloc instead ?
thanks in advance
AnswerRe: array memory alignment ?memberAlex Fr12-May-05 3:07 
I have never try to make _m128[] array, I don't know exactly whether it is aligned or not. What is a purpose to make such array? We need _m128 variable to work with SSE registers, input and output vectors should be kept in float array.
Generalperformance loss using SSEmemberDavid St. Hilaire3-Dec-04 8:18 
Thanks for the article.
 
I executed your sample apps, and there is a significant performance boost when using SSE instead of just C++.   However, the functions I've written in with SSE intrinsics have been taking 2-3 times as long to execute as their C++ counterparts.   Do you know what might cause this?
 
Below is a function I wrote to get the minimum and maximum values of an array.   This executes in roughly 80-90 microseconds on an array of 640 numbers.   The C++ function that does the same thing takes 28-31 microseconds.   What gives?   The SSE version has to do the memcpy to get the input array aligned correctly, but this only accounts for about 26 microseconds of the difference.   I realize that I'm using shorts instead of floats, but it should still work.   I converted your SSESample program to use shorts and only calculate the min and max of the input array.   The SSE code executed less than twice as fast as the C++ code after that, but it was still faster.
 
Here's the code:
<code>
void FindArrayMinMax(short *pnArray, long nCount, short &nMin, short &nMax)
{
     short *pnIn = (short*) _aligned_malloc(nCount*sizeof(short), 16);     //     16-byte aligned for SSE
     memcpy(pnIn, pnArray, nCount*sizeof(short));
     long nOutputSize = 4 + nCount%4;
     short *pnMaxOut = (short*) _aligned_malloc(nOutputSize*sizeof(short), 16);
     short *pnMinOut = (short*) _aligned_malloc(nOutputSize*sizeof(short), 16);
 
     __m64 *pmIter = (__m64*) pnIn;
     __m64 *pmMax = (__m64*) pnMaxOut;
     __m64 *pmMin = (__m64*) pnMinOut;
 
     *pmMax = *pmMin = *pmIter;     //     save first 4 values as minima and maxima
     long nLoop = nCount/4;
     for (int i=1; i<nLoop; i++)
     {
          //     get next 4 values and compare them to the saved minima and maxima
          pmIter++;
          *pmMax = _mm_max_pi16(*pmIter, *pmMax);
          *pmMin = _mm_min_pi16(*pmIter, *pmMin);
     }
 
     //     get strays, in case nCount is not a multiple of 4
     short nVal(0);
     nLoop = nCount % 4;
     for (i=1; i<=nLoop; i++)
     {
          nVal = pnIn[nCount-i];
          pnMaxOut[i+3] = nVal;
          pnMinOut[i+3] = nVal;
     }
 
     //     get max and min indices
     nMax = pnMaxOut[0];
     nMin = pnMinOut[0];
     for (i=1; i<nOutputSize; i++)
     {
          if (nMax < pnMaxOut[i])
               nMax = pnMaxOut[i];
          if (nMin > pnMinOut[i])
               nMin = pnMinOut[i];
     }
 
     //     cleanup
     _aligned_free(pnIn);
     _aligned_free(pnMinOut);
     _aligned_free(pnMaxOut);
     _mm_empty();
}
</code>
GeneralRe: performance loss using SSEmemberAlex Farber3-Dec-04 8:33 
640 is not significant number to use SSE. You need to do this for very long arrays, whuch are used in image processing, graphics, 3D etc.
My second sample shows how to find minimum and maximum, I don't see something similar in your code. Does it give right result? Instead of copying of the whole array to aligned array, you need to start from the first aligned input array member.
Anyway, you need to use MMX for this short numbers, take a look at my MMX article. On Pentium 4 you can use SSE2.
Sorry that I don't try to understand your code, SSE programming takes a lot of time. I can try to do this, but code must be clear, without float-short tricks.
GeneralRe: performance loss using SSEmemberDavid St. Hilaire3-Dec-04 9:38 
Thanks for your response.   I realize that 640 is not a lot of elements, but this function is called many, many, times and it is slowing down my app.
   I do use code similar to yours to find the min and max, except that I'm using _mm_min/max_pi16 instead of _mm_min/max_ps.   It does return the correct result; I've checked it against the C++ version of the function.
   There aren't min and max functions in MMX, but I was able to get it working by using the greater than function.   Unfortunately, it takes more instructions and is a little slower than SSE.   I don't know what you mean by "float-short" tricks in my code.   There were no floats at all in the code that I posted.
   You don't have to read my code if you don't want to.   The example I posted isn't the only time I've had SSE code run slower than C++.   I just thought you or someone else might have some ideas why SSE in general would run slower than normal C++ code.
 
How do you determine which element of an array is the first aligned input array member?
 
Thanks again,
Dave

GeneralRe: performance loss using SSEmemberAlex Farber3-Dec-04 21:30 
Well, this is my code:
 
void FindMinMaxC(short* pnArray, int size, short& min, short& max)
{
max = SHRT_MIN;
min = SHRT_MAX;
 
for ( int i = 0; i < size; i++ )
{
if ( *pnArray < min )
min = *pnArray;
 
if ( *pnArray > max )
max = *pnArray;
 
pnArray++;
}
}
 
void FindMinMaxSSE(short* pnArray, int size, short& min, short& max)
{
int i;
 
union u
{
__m64 m;
short n[4];
} x;
 

for ( i = 0; i < 4; i++ )
x.n[i] = SHRT_MIN;
 
__m64 max64 = x.m;
 
for ( i = 0; i < 4; i++ )
x.n[i] = SHRT_MAX;
 
__m64 min64 = x.m;
 

__m64* pSource = (__m64*) pnArray;
 
for ( i = 0; i < size/4; i++ )
{
min64 = _mm_min_pi16(*pSource, min64);
max64 = _mm_max_pi16(*pSource, max64);
 
pSource++;
}
 
x.m = min64;
min = min(x.n[0],
min(x.n[1],
min(x.n[2],
x.n[3])));
 
x.m = max64;
max = max(x.n[0],
max(x.n[1],
max(x.n[2],
x.n[3])));
}
 
I don't care about alignment and array size in the FindMinMaxSSE function, assuming that client does this.
Test results for 1000000 members:
C++ 20 ms
SSE 7 ms
 
Testing for 10000 members I get 0 in both cases.
 
Tests must be done in Release configuration. Again, there is no need to use SSE for small arrays. It doesn't matter that you call function many times. Array must be very long to get performance boost from SSE. In your case, use C++ code.
GeneralAMD supportmemberJens froslev-nielsen1-Dec-04 1:20 
Thanks for 2 wellwritten articles (sseintro & mmxintro).
Now I wonder do U - or perhaps anybody in here know how to implement/using the 3DNow technology in a same matter as shown in here?.

Generalq: movaps vs. movupsmemberyoaz4-Nov-04 8:31 
sorry to bother u again with beginner's questions, but i'm quite stuck.
I have a class using SSE. I'm declaring a member private variable:
__declspec(align(16))unsigned char m_nodes[ARRAY_SIZE];
later on i try to use it in an asm block,
movaps	xmm0, [esi]
with esi pointing to the array base address. This however throws an exception, which is because the array is not aligned (the base address should be a multiple of 16, am i right?).
I can't figure it out. why isn't my array aligned?
another, final, question: do you know, or can u point me to the actual performance difference between movaps and movups
 
thanks
 
there are no facts, only interpretations
GeneralRe: q: movaps vs. movupsmemberAlex Farber5-Nov-04 2:35 
1) What is ARRAY_SIZE value? Why variable type is unsigned char and not float? What exception exactly do you have?
2) Take a look at Assembly code generated by C++ compiler from movaps and movups.
GeneralRe: q: movaps vs. movupsmemberyoaz5-Nov-04 2:54 
Alex Farber wrote:
What is ARRAY_SIZE value?
it's an int, value=16
 
Alex Farber wrote:
Why variable type is unsigned char and not float?
I want to use SSE2 for SIMD operations on 16 bytes
 
Alex Farber wrote:
What exception exactly do you have?
SEHException. But it occurs with movaps and not with movups.
 
Alex Farber wrote:
Take a look at Assembly code generated by C++ compiler from movaps and movups.
i was hoping to generate the Assembly code myself (working with inline Assembly), but i'll debug again.
 
Thanks a lot for the suggestions. I've managed to work around this, by using _aligned_malloc, though I have no idea why this aligns member variables, and __declspec(align(16)) doesn't. Any ideas?
 
thanks again,
 
there are no facts, only interpretations
GeneralExcelent! + a questionmemberyoaz19-Sep-04 23:52 
A realy interesting and enlightening article. I have a small question: as I understand, MMX uses mm0-mm7 registers, which are actually CPU floating point registers, whereas SSE/2 uses xmm0-xmm7 registers, which where especially defined for SIMD purposes. And finally, the question(s) -
  1. Does this mean that I can use both types of registers simultaneously?
  2. Does this mean that I can do without the EMMS instruction when writing pure SSE/2 code?
thanks, I realy enjoyed this article
 
there are no facts, only interpretations
GeneralRe: Excelent! + a questionmemberAlex Farber20-Sep-04 0:39 
AFAIK, EMMS instruction must be used only with MMX:
 
The EMMS instruction must be used to clear the MMX™ technology state at the end of all MMX™ technology routines and before calling other procedures or subroutines that may execute floating-point instructions. If a floating-point instruction loads one of the registers in the FPU register stack before the FPU tag word has been reset by the EMMS instruction, a floating-point stack overflow can occur that will result in a floating-point exception or incorrect result.
 
SSE doesn't require this instruction.
I don't have experience in using SSE2.
GeneralRe: Excelent! + a questionmemberyoaz20-Sep-04 1:54 
thanks Big Grin | :-D
 
there are no facts, only interpretations
QuestionIs this a VC 2003 compiler BUG ?memberleandrobecker10-Jun-04 4:51 
Hi
 
I´m trying to measure some codes (beginning to SSE) and when compiling the below code in Release (optimized for speed) in VC++ 2003 the optimizer makes some weird things (put a breakpoint at the start of the main and you will see).

// SSE.cpp : Defines the entry point for the console application.
//
 
#include "stdafx.h"
#include
 
// *** BEGIN OF INCLUDE SECTION 1
// *** INCLUDE THE FOLLOWING DEFINE STATEMENTS FOR MSVC++ 5.0
 
//#define CPUID __asm __emit 0fh __asm __emit 0a2h
//#define RDTSC __asm __emit 0fh __asm __emit 031h
 
// *** END OF INCLUDE SECTION 1
 
#define SIZE 1
 
// *** BEGIN OF INCLUDE SECTION 2
// *** INCLUDE THE FOLLOWING FUNCTION DECLARATION AND CORRESPONDING
// *** FUNCTION (GIVEN BELOW)
unsigned FindBase();
// *** END OF INCLUDE SECTION 2
 
int _tmain(int argc, _TCHAR* argv[])
{
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
__m128 Data1, Data2, Res, Data3;
Res.m128_f32[0] = 0.0f;
Res.m128_f32[1] = 0.0f;
Res.m128_f32[2] = 0.0f;
Res.m128_f32[3] = 0.0f;
float Values1[] = { 1.0f, 2.0f, 3.0f, 4.0f };
float Values2[] = { 1.0f, 2.0f, 3.0f, 4.0f };
float Results[] = { 0.0f, 0.0f, 0.0f, 0.0f };
int i;
 
// *** BEGIN OF INCLUDE SECTION 3
// *** INCLUDE THE FOLLOWING DECLARATIONS IN YOUR CODE
// *** IMMEDIATELY AFTER YOUR DECLARATION SECTION.
unsigned base=0, iterations=0, sum=0;
unsigned cycles_high1=0, cycles_low1=0;
unsigned cycles_high2=0, cycles_low2=0;
unsigned __int64 temp_cycles1=0, temp_cycles2=0;
__int64 total_cycles=0; // Stored signed so it can be converted
// to a double for viewing
double seconds=0.0L;
unsigned mhz=2000; // If you want a seconds count instead
// of just cycles, enter the MHz of your
// machine in this variable.
base=FindBase();
// *** END OF INCLUDE SECTION 3
 
for (i=0; i
GeneralIntel compilermemberLars Schouw18-Apr-04 21:06 
Did you try this code out with the Intel C++ 8 compiler as well?
Yields muich better performance than the 10% you will get with VC 7.1, more like 350% faster!!
 
Regards
Lars Schouw

GeneralRe: Intel compilermemberAlex Farber18-Apr-04 21:21 
The only Intel program I was working with is IPL (Image Processing Library). It is so good that I beleive you.
QuestionHow about double data type?sussmrskyok16-Feb-04 0:28 
Confused | :confused: Can we use SSE to handle double type data? Thanks!
AnswerRe: How about double data type?memberAlex Farber16-Feb-04 0:35 
AFAIK, we cannot do this.
GeneralRe: How about double data type?membernutty9-Feb-05 3:56 
What I can understand from the VC 2003 manual is that for most intrinsic commands usd here, there is an equivalent for double data types.
_mm_mul_pd vs. _mm_mul_ps
 
and there is a __m128d also.
 
I tried it out, I included emmintrin.h, but I had to search for it in the include folder from VS Frown | :-(
 
I did a lot of replacements in SSETestDlg class:
 
float -> double
int nLoop = nSize/ 4; -> int nLoop = nSize/ 2;
_ps -> _pd ( also in the assembly part )
_m128 -> _m128d
 
and what I understand is that I am using SSE2 rather than SSE now
 
For my surprise it didn't crash, but gave correct results.
But, unfortunately it wasn't significantly faster than c++ anymore Frown | :-(
 
can anyone tell why??
 

 


AnswerRe: How about double data type?memberdoug6553611-Aug-08 0:50 
Double-precision floating point requires SSE2. If you run on capable hardware, it works mostly the same as SSE, except:
 
  • Doubles are twice as big, so half as many values fit in each register.
  • Where single-precision intrinsics typically end with _ss or _ps (scalar-single or packed-single), double-precision arithmetic intrinsics end with _sd or _pd (scalar-double or packed-double).
  • You will find that the compiler often wants __m128d data types, which represent double-precision vectors.
  • You'll probably need to #include <xmmintrin.h>
 
Take a good look at intrin.h and xmmintrin.h to get a good idea of the operations available.
GeneralVery good!memberVincent Leong773-Aug-03 20:02 
Alex, if you ever had publish this article 3 months ago, It will ease my headache on processor optimization.
Good article.
 
Crystal Silver Codes
vleong@first.net.my
Generalhow to use SSE under Linux?memberEagleCalifornia9-Sep-03 1:43 
Anyone knows? how to convert the prewritten code into linux icc?
GeneralRe: how to use SSE under Linux?memberAlex Farber9-Sep-03 2:18 
According to this document:
http://www.tacc.utexas.edu/resources/user_guides/intel/c_ug_lnx.pdf
SSE is supported exactly like in VC 7.1. However, I suggest you to ask this question in some non-Visual C++ forum, where Linux and Unix programmers can help you. For example:
http://www.codeguru.com/forum/forumdisplay.php?s=&forumid=9

GeneralRe: how to use SSE under Linux?sussgnuLNX23-Sep-03 3:59 
Not sure about using the icc, but for gcc you can compile the same code found here by using the -msse switch.
GeneralRe: how to use SSE under Linux?sussChristophe Avoinne18-Oct-03 2:39 
wrong, you have not __m128 in GNU and intrisic functions are different
 
to have it :
 
typedef float __m128 __attribute__( ( mode( V4SF ), aligned( 16 ) ) ); // supposedly __m128 must be aligned
 

to add two __m128 variables a and b :
 

__m128 c = __builtin_ia32_addps( a, b );
 
as you can see it, you must prefix your SSE instruction with "__builtin_ia32_" to execute it, which is definitely not compatible with MS intrinsic functions.
 
By the way, y = _mm_set_ps1( x ) is "y = __builtin_ia32_loadss( x ); y = __builtin_shufps( y, y, 0 );"
 
/chris
 

GeneralRe: how to use SSE under Linux?sussgnuLNX20-Oct-03 1:53 
Actually I do have an __m128 data type for floating point operations. I am using gcc 3.3.1. Actually I have learned a couple of tricks since my last post here that others might find useful. I am typeing this code from memory so it my not be perfect. But the basic idea is to use a union to hold both the __m128 data type and a float [4] data type. Remember that unions datatypes all occupy the same memory address and the type is switched depending on how you use it.
 
union myDataType {
__m128 vec;
float arr[4] __attribute__aligned( 16 )));
}
 
Now the data can be stored interchangably in both. TO acess the __m128 data type use myDataType.vec and to access arr use myDataType.arr;
 
Good luck to everyone out there. BTW take what I say with a grain of salt. I claim to be no guru just someone who loves to code and figure stuff out.
 
Thanks for the great discussion and great site.
 
-gnuLNX
GeneralRe: how to use SSE under Linux?memberPSuade20-Oct-03 8:40 
hummm... I'm using a GCC 3.2.3 but fail to have __m128. Are you sure you don't need to include a header file ?
 
I strongly discourage you to use a union, because it turns off some optimizations ( I tried a lot of tricks with SSE builtins ).
 
For those who are interested by real SSE optimization in C++ :
 
////////////////////////////////////////////////////////////////////////////////
 
#define always_inline inline __attribute__( ( always_inline ) )
 
////////////////////////////////////////////////////////////////////////////////
 
typedef float __v4sf __attribute__( ( mode( V4SF ), aligned( 16 ) ) );
 
////////////////////////////////////////////////////////////////////////////////
 
struct v4sf
{
__v4sf v;
 
///
always_inline
v4sf( ) { }
 
always_inline
v4sf( __v4sf _1 ) : v( _1 ) { }

always_inline
operator __v4sf( ) const { return v; }
};
 
always_inline
v4sf operator +( v4sf _1, v4sf _2 )
{ return __builtin_ia32_addps( _1.v, _2.v ); }
 
always_inline
v4sf operator +( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_addps( _1, _2.v ); }
 
always_inline
v4sf operator +( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_addps( _1.v, _2 ); }
 
always_inline
v4sf operator -( v4sf _1, v4sf _2 )
{ return __builtin_ia32_subps( _1.v, _2.v ); }
 
always_inline
v4sf operator -( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_subps( _1, _2.v ); }
 
always_inline
v4sf operator -( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_subps( _1.v, _2 ); }
 
always_inline
v4sf operator *( v4sf _1, v4sf _2 )
{ return __builtin_ia32_mulps( _1.v, _2.v ); }
 
always_inline
v4sf operator *( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_mulps( _1, _2.v ); }
 
always_inline
v4sf operator *( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_mulps( _1.v, _2 ); }
 
always_inline
v4sf operator /( v4sf _1, v4sf _2 )
{ return __builtin_ia32_divps( _1.v, _2.v ); }
 
always_inline
v4sf operator /( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_divps( _1, _2.v ); }
 
always_inline
v4sf operator /( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_divps( _1.v, _2 ); }
 
////////////////////////////////////////////////////////////////////////////////
 
Now using "struct v4sf" would help compiler to allocate SSE registers without putting v in stack. Using a union would prevent compiler from register optimizations and put v in stack even if a SSE register were more appropriate.
 
v4sf a,b,d;
void f()
{
d = a * ( d + b );
}
 
that gives us :
 

65: 0f 28 3d 20 00 00 00 movaps 0x20,%xmm7
6c: 0f 28 35 00 00 00 00 movaps 0x0,%xmm6
73: 0f 58 3d 10 00 00 00 addps 0x10,%xmm7
7a: 0f 59 f7 mulps %xmm7,%xmm6
7d: 0f 29 35 20 00 00 00 movaps %xmm6,0x20
 
Now if we replace :
 
struct v4sf
{
union { __v4sf v; float f[4]; }
 
...
 
that gives us ( what a ugly code ! ) :
 
65: a1 00 00 00 00 mov 0x0,%eax
6a: 89 45 d8 mov %eax,0xffffffd8(%ebp)
6d: a1 04 00 00 00 mov 0x4,%eax
72: 89 45 dc mov %eax,0xffffffdc(%ebp)
75: a1 08 00 00 00 mov 0x8,%eax
7a: 89 45 e0 mov %eax,0xffffffe0(%ebp)
7d: a1 0c 00 00 00 mov 0xc,%eax
82: 89 45 e4 mov %eax,0xffffffe4(%ebp)
85: 0f 28 75 d8 movaps 0xffffffd8(%ebp),%xmm6
89: a1 20 00 00 00 mov 0x20,%eax
8e: 89 45 b8 mov %eax,0xffffffb8(%ebp)
91: a1 24 00 00 00 mov 0x24,%eax
96: 89 45 bc mov %eax,0xffffffbc(%ebp)
99: a1 28 00 00 00 mov 0x28,%eax
9e: 89 45 c0 mov %eax,0xffffffc0(%ebp)
a1: a1 2c 00 00 00 mov 0x2c,%eax
a6: 89 45 c4 mov %eax,0xffffffc4(%ebp)
a9: 0f 28 7d b8 movaps 0xffffffb8(%ebp),%xmm7
ad: a1 10 00 00 00 mov 0x10,%eax
b2: 89 45 a8 mov %eax,0xffffffa8(%ebp)
b5: a1 14 00 00 00 mov 0x14,%eax
ba: 89 45 ac mov %eax,0xffffffac(%ebp)
bd: a1 18 00 00 00 mov 0x18,%eax
c2: 89 45 b0 mov %eax,0xffffffb0(%ebp)
c5: a1 1c 00 00 00 mov 0x1c,%eax
ca: 89 45 b4 mov %eax,0xffffffb4(%ebp)
cd: 0f 58 7d a8 addps 0xffffffa8(%ebp),%xmm7
d1: 0f 29 7d c8 movaps %xmm7,0xffffffc8(%ebp)
d5: 0f 59 75 c8 mulps 0xffffffc8(%ebp),%xmm6
d9: 0f 29 75 e8 movaps %xmm6,0xffffffe8(%ebp)
dd: 8b 45 e8 mov 0xffffffe8(%ebp),%eax
e0: a3 20 00 00 00 mov %eax,0x20
e5: 8b 45 ec mov 0xffffffec(%ebp),%eax
e8: a3 24 00 00 00 mov %eax,0x24
ed: 8b 45 f0 mov 0xfffffff0(%ebp),%eax
f0: a3 28 00 00 00 mov %eax,0x28
f5: 8b 45 f4 mov 0xfffffff4(%ebp),%eax
f8: a3 2c 00 00 00 mov %eax,0x2c
 
So you shouldn't mix things like it.
 
even :
struct v4sf
{
union { __v4sf v; float f[4] __attribute( ( aligned( 16 ) ) ); }
 
...
 
or :
 
struct v4sf
{
union { __v4sf v; float f[4]; } __attribute( ( aligned( 16 ) ) );
 
...
 
don't change anything.
 
To access a 4 floats, just create another class float4 with conversion operator between v4sf and float4.
 
Oh yeah, flags were :
-march=athlon-xp
-fomit-frame-pointer
-mfpmath=sse
-O6

GeneralRe: how to use SSE under Linux?sussgnuLNX20-Oct-03 10:57 
First off let me say that I enjoyed looking over your code...some nice things you have done. I have a couple of answers for you as well as a couple of questions to help me better understand what is going on.
 
You are totally correct the 3.2 tree does not have the __m128 intrinsic. However the 3.3.1 tree does. Here is an small peice from the xmmintrin.h header file. I switched to 3.3.1 for this very reason. Also the only reason I ever used a union was so that I could easily read and write the xmm data to main memory. May or may not be a good idea....I still want to do some benchmarks against your very cool code!
 
static __inline __m128
_mm_add_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
}
 
static __inline __m128
_mm_sub_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_subss ((__v4sf)__A, (__v4sf)__B);
}
 
static __inline __m128
_mm_mul_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_mulss ((__v4sf)__A, (__v4sf)__B);
}
 
static __inline __m128
_mm_div_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_divss ((__v4sf)__A, (__v4sf)__B);
}
 
static __inline __m128
_mm_sqrt_ss (__m128 __A)
{
return (__m128) __builtin_ia32_sqrtss ((__v4sf)__A);
}
 
static __inline __m128
_mm_rcp_ss (__m128 __A)
{
return (__m128) __builtin_ia32_rcpss ((__v4sf)__A);
}
 

Now down to the business of structures vs unions?
I assume that you used the structure to build your __v4sf datatype correct? Since this datatype "is" includeded in "3.3.1" then do we still need to take your route. Again I am certainly no guru but I am definely getting some descent speed gains in my code. Now back to something I saw in your assembly dumb that caught me attention. You code does seem to make use of much more than simply xmm0 registers, but when I compile your code in same manner ommiting the atholon switch I still only use register xmm0? In fact no matter how I compile it I still only use one xmm register. Do you have any thoughts on this?
 
Also since the 3.3.1 tree does have __m128 data types would you be willing to do a similar structure overloading * and + and such.
 
Thanks again for youo valuable info and sorry upfront if I sound like I have no clue....probably because I really don't!
 

GeneralRe: how to use SSE under Linux?memberPSuade21-Oct-03 9:40 
I'm using the last release of Dev-C++ with GCC 3.2.3, because I dislike VC 6.0/7.0 which are not compliant with ISO C++, especially because I'm a C++ guru and like to use templates in unusual ways. So compatibility with xmmintrin.h is not something I care about. Formerly, I was an assembly coder, and found the special inclusion of asm statements in GCC to be the best I have ever seen : you can let the compiler to choose which registers to allocate or to use in such a way that global optimizations can happen. Not something you can really do with VC 6.0/7.0. But I admit that having xmmintrin.h helps us to reuse existing code using it.
 
My warning about not using union comming from what you mustn't have an array of float in a struct if you want for your compiler to use a register instead of memory slots in stack for your vectors of float, because an array means memory slots.
 
Why using a struct ? well it is the only way allowing us to use operators with in fact. But I must admit that code generated that way is not always very good especially with complex expression.
 
I tried the same code with -march=pentium3 (minimal for having SSE) instead of -march=athlon-xp and found the same result.
 
Here my flags I added :
-march=pentium3 (SSE only) / -march=pentium4 (SSE,SSE2) / -march=athlon-xp (SSE, 3Dnow!, Ext3DNow! )
-fomit-frame-pointer
-mfpmath=sse
-O6
-fssa
-fssa-dce
-fssa-ccp
-fprefetch-loop-arrays
 
otherwise you may need to add -msse ( I don't need it apparently )
 
How to feed a struct v4sf with an array of float ?
 
////////////////////////////////////////////////////////////////////////////////
 
typedef float __v4sf __attribute__( ( mode( V4SF ), aligned( 16 ) ) );
 
////////////////////////////////////////////////////////////////////////////////
 
struct v4sf
{
__v4sf v;
 
///
always_inline
v4sf( ) { }
 
always_inline
v4sf( __v4sf _1 )
: v( _1 ) { }
 
always_inline
v4sf( float const *_1 )
: v( __builtin_ia32_loadups( ( float * )_1 ) ) { }
 
always_inline
operator __v4sf( ) const { return v; }
 
};
 
struct float4
{
union { float v[4]; };

///
always_inline
float4( ) { }
 
always_inline
float4( __v4sf _1 )
{ __builtin_ia32_storeups( v, _1 ); }
 
always_inline
operator __v4sf( ) const { return v4sf( v ); }
};
 
...
 
float const f[4] = { 1.0, -1.0, 1.0, -1.0 };
 
__v4sf compute( __v4sf a, __v4sf b, __v4sf c )
{
return a * ( v4sf( b ) + c ) - f;
}
 
gives us :
 
000000a0 <__Z7computeU8__vectorfS_S_>:
a0: 0f 28 54 24 14 movaps 0x14(%esp,1),%xmm2
a5: 0f 28 44 24 04 movaps 0x4(%esp,1),%xmm0
aa: 0f 58 54 24 24 addps 0x24(%esp,1),%xmm2
af: 0f 59 c2 mulps %xmm2,%xmm0
b2: 0f 10 15 04 01 00 00 movups 0x104,%xmm2
b9: 0f 5c c2 subps %xmm2,%xmm0
bc: c3 ret
 
My mail address is paul.suade@laposte.net.

GeneralRe: how to use SSE under Linux?sussgnuLNX22-Oct-03 2:25 
Hey Paul thanks a lot. You have given me a lot to think about. I should mention that I am using gcc with linux and not with microsoft....BTW does it integrate well with Dev-C++. I think I am going to be doing some windows coding in the near future.
GeneralRe: how to use SSE under Linux?memberPSuade22-Oct-03 4:22 
Dev-C++ is not so bad. It is an IDE for MINGW32 GCC compiler but you can indeed change the gcc compiler if CYGWIN is needed.
 
I think it could be an ideal IDE for using GCC in both WIN32 and linux platforms.
 
But I don't think Dev-C++ can replace all the features of VC IDE.
 
Have a nice coding.
 
P.S.: I got the x/emmintri.h files from CVS. But I'm still incertain if I must use it.
GeneralSSE2 Examples...membergodot_gildor31-Jul-03 6:52 
Excellent article and excellent examples. Now I'm looking for an example project using SSE2 intrinsics. I noticed that the Swarm project from MS says that it uses both MMX and SSE2, but when I download the project, it only contains the MMX code. Do you know of any example code for SSE2 that I can look over?
 
-Brett

GeneralRe: SSE2 Examples...memberAlex Farber2-Aug-03 20:06 
As I remember, SSE2 extends both MMX and SSE technologies. It allows to work with double-presicion numbers and has set of MMX-like instructions using 128 bits SSE registers. MMXSwarm sample contains such MMX-like SSE2 instructions. I beleive you can find some information making a search for SSE2 with Google, this is one good link, for example:
http://www.intel.com/update/departments/software/sw03011.pdf
Vincent Leong from CodeGuru published an article about MMX:
http://www.codeguru.com/cpp_mfc/MMXDemo.html
and promised some SSE stuff in his next article.
GeneralSSE2 Examples...membergodot_gildor31-Jul-03 6:51 
Excellent article and excellent examples. Now I'm looking for an example project using SSE2 intrinsics. I noticed that the Swarm project from MS says that it uses both MMX and SSE2, but when I download the project, it only contains the MMX code. Do you know of any example code for SSE2 that I can look over?

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130617.1 | Last Updated 11 Jul 2003
Article Copyright 2003 by Alex Fr
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid