 |
 | How to make it more efficiently Shang Chieh, Chou | 17:46 15 Mar '09 |
|
 |
Hi: After reading yor article, I knew how to write a simple sse code. But if my source data is unsigned short. How to make it more effiently? Because after aligning the unsigned short data, it still need the same number count of for loop such as: to divide Source1 by Source2 it needs 10 count of for loop when using sse unsigned short Source1[40] unsigned short Source1[40]
but if I can pack it into float, maybe it can only need 5 count of loop does it? and how to make it? thanks
|
|
|
|
 |
|
 |
Working with integers, you need MMX or SSE2, and not SSE. This is MMX introduction: http://www.codeproject.com/KB/recipes/mmxintro.aspx[^]
SSE2 has the same capabilities of integer operations, but has larger registers. Anyway, these technologies are used for huge arrays, there is no sence to use them for small amount of data. Also, on modern computers SSE and MMX do not give such significant performance boost, like on old Pentium III.
|
|
|
|
 |
 | Faulty performance comparison JAF1234567890 | 9:38 20 Jul '07 |
|
 |
The SSE C++ and inline assembly timings include two optimizations; that due to SSE calculation s and that due to a factor of 4 loop unrolling. It will be enlightening to compare a loop unrolled native C++ timing with the other two methods.
Jeff
Jeff
|
|
|
|
 |
|
 |
Good point, I actually did this just now (a whole 3 years after your post!)
and the result on my PC is: (Intel Quad-core Xeon, 1.8GHz)
C++ (Loop unrolled in blocks of 4): 14ms Asm: 8ms C++ Intrinsics: 8ms
Also I increased the array size to 1,000,000 to get a better time.
So yes, unrolling that loop does optimise the process (as C++ compiler is probably using SSE itself).
Now I also converted the application to process doubles not floats (as I require doubles) using SSE2 and the results are that the C++ loop unrolled now executes in 11ms vs 8ms for SSE2.
Hmmm. Well I'm doing something wrong.
|
|
|
|
 |
|
 |
.... In fact, the same operation is only marginally slower (15%) in C# when using loop unrolling
I realise this SSE example is not intended for performance, just for demo purposes but it seems there is more than meets the eye when optimizing SSE code for fast execution.
|
|
|
|
 |
|
 |
.. Wait Im smokin' something obviously. I didn't execute the full loop in C#
Used
for(int i = 0; i < ARRAY_SIZE/4; i++)
whereas I should've used
for(int i = 0; i < ARRAY_SIZE; i += 4)
Just getting it to the point where all implementations actually work (and give same result, no memory errors) and I'll post the results.
C# vs C# Unrolled vs C++ vs C++ Unrolled vs C++/SIMD Intrinsics vs SIMD Asm
|
|
|
|
 |
 | SSE instructions!! minabeh | 2:37 14 Jun '07 |
|
 |
Hi.... is there any instruction to add 4 units of 32 bits in packed data type(__m128)???
Thanks..
|
|
|
|
 |
 | A question lei_ma2003 | 23:59 18 Apr '07 |
|
 |
Could SSE be used in managed C++ application?
|
|
|
|
 |
 | Question shaihnc | 11:41 22 Jul '05 |
|
 |
I have decided to use MSDN insturction for SSE2 programing.
I load up 8 , 16 bit short number into _m128i variable as follow:
_declspec(align(16)) short t1[100000]; _declspec(align(16)) short t2[100000]; __m128i temp1, temp2; __m128i mul1,mul2;
temp1 = _mm_load_si128((__m128i*) ((short *) &t1[i])); temp2 = _mm_load_si128((__m128i*) ((short *) &t2[i]));
then I use the _mm_mullo_epi16 function to get the multipication of my variables.
mul1 = _mm_mullo_epi16(temp1,temp2);
So now, I have the lower 16 bit of 32 bit result in mul1. now, I want to be able to add this 8 - 16 bit short values together or be able to seperate them.
I can not find any instruction whcih lets me do that
Can some one plzzzzzzzzzzzzzzzzz help me.
|
|
|
|
 |
|
 |
I know this is probably way late and you already figured it out, but did you try a union?
|
|
|
|
 |
 | A question Sachini M | 17:39 7 Jun '05 |
|
 |
Excellent article!
I'm very new to this topic and have a question. When using SSE, does the number of iterations of each loop always have to be a multiple of 4? Lets say you need to do a check (if statement inside the loop) at every iteration, is there a way to use SSE? or is there any use using it?
Thanks in advance!
Regards, Sachini
|
|
|
|
 |
|
 |
There is no any restriction on number of iterations, but every iteration works with 4 float numbers. This means, array size should be multiple of 4, and number of iterations is array size/4.
|
|
|
|
 |
|
 |
Well, actually, arrays do not need to be a multiple of 4. What you can do is for the portion that is a multiple of 4, do the SSE instructions, and with what's left over, do the regular way without SSE (which will be at max 3 iterations). This lets your optimization be dynamic across multiple array sizes.
So say you want to mess with an array of size 37. The first 36 you do with the SSE implementation, the last 1 you do with the normal implementation (without SSE).
It was a great question that wasn't addressed in the article. It's best practice to assume when creating such a function using SSE that it allows for arrays of any size.
--- punkbuster
|
|
|
|
 |
 | array memory alignment ? not_happy0 | 21:08 11 May '05 |
|
 |
hi, i am new to SSE, and is wondering, if _m128 data type is "auto-aligned" why doing a new _m128[xx] is not aligned ? I seems to have to use _aligned_malloc instead ? thanks in advance
|
|
|
|
 |
|
 |
I have never try to make _m128[] array, I don't know exactly whether it is aligned or not. What is a purpose to make such array? We need _m128 variable to work with SSE registers, input and output vectors should be kept in float array.
|
|
|
|
 |
 | performance loss using SSE David St. Hilaire | 9:18 3 Dec '04 |
|
 |
Thanks for the article.
I executed your sample apps, and there is a significant performance boost when using SSE instead of just C++. However, the functions I've written in with SSE intrinsics have been taking 2-3 times as long to execute as their C++ counterparts. Do you know what might cause this?
Below is a function I wrote to get the minimum and maximum values of an array. This executes in roughly 80-90 microseconds on an array of 640 numbers. The C++ function that does the same thing takes 28-31 microseconds. What gives? The SSE version has to do the memcpy to get the input array aligned correctly, but this only accounts for about 26 microseconds of the difference. I realize that I'm using shorts instead of floats, but it should still work. I converted your SSESample program to use shorts and only calculate the min and max of the input array. The SSE code executed less than twice as fast as the C++ code after that, but it was still faster.
Here's the code: <code> void FindArrayMinMax(short *pnArray, long nCount, short &nMin, short &nMax) { short *pnIn = (short*) _aligned_malloc(nCount*sizeof(short), 16); // 16-byte aligned for SSE memcpy(pnIn, pnArray, nCount*sizeof(short)); long nOutputSize = 4 + nCount%4; short *pnMaxOut = (short*) _aligned_malloc(nOutputSize*sizeof(short), 16); short *pnMinOut = (short*) _aligned_malloc(nOutputSize*sizeof(short), 16);
__m64 *pmIter = (__m64*) pnIn; __m64 *pmMax = (__m64*) pnMaxOut; __m64 *pmMin = (__m64*) pnMinOut;
*pmMax = *pmMin = *pmIter; // save first 4 values as minima and maxima long nLoop = nCount/4; for (int i=1; i<nLoop; i++) { // get next 4 values and compare them to the saved minima and maxima pmIter++; *pmMax = _mm_max_pi16(*pmIter, *pmMax); *pmMin = _mm_min_pi16(*pmIter, *pmMin); }
// get strays, in case nCount is not a multiple of 4 short nVal(0); nLoop = nCount % 4; for (i=1; i<=nLoop; i++) { nVal = pnIn[nCount-i]; pnMaxOut[i+3] = nVal; pnMinOut[i+3] = nVal; }
// get max and min indices nMax = pnMaxOut[0]; nMin = pnMinOut[0]; for (i=1; i<nOutputSize; i++) { if (nMax < pnMaxOut[i]) nMax = pnMaxOut[i]; if (nMin > pnMinOut[i]) nMin = pnMinOut[i]; }
// cleanup _aligned_free(pnIn); _aligned_free(pnMinOut); _aligned_free(pnMaxOut); _mm_empty(); } </code>
|
|
|
|
 |
 | Re: performance loss using SSE Alex Farber | 9:33 3 Dec '04 |
|
 |
640 is not significant number to use SSE. You need to do this for very long arrays, whuch are used in image processing, graphics, 3D etc. My second sample shows how to find minimum and maximum, I don't see something similar in your code. Does it give right result? Instead of copying of the whole array to aligned array, you need to start from the first aligned input array member. Anyway, you need to use MMX for this short numbers, take a look at my MMX article. On Pentium 4 you can use SSE2. Sorry that I don't try to understand your code, SSE programming takes a lot of time. I can try to do this, but code must be clear, without float-short tricks.
|
|
|
|
 |
|
 |
Thanks for your response. I realize that 640 is not a lot of elements, but this function is called many, many, times and it is slowing down my app. I do use code similar to yours to find the min and max, except that I'm using _mm_min/max_pi16 instead of _mm_min/max_ps. It does return the correct result; I've checked it against the C++ version of the function. There aren't min and max functions in MMX, but I was able to get it working by using the greater than function. Unfortunately, it takes more instructions and is a little slower than SSE. I don't know what you mean by "float-short" tricks in my code. There were no floats at all in the code that I posted. You don't have to read my code if you don't want to. The example I posted isn't the only time I've had SSE code run slower than C++. I just thought you or someone else might have some ideas why SSE in general would run slower than normal C++ code.
How do you determine which element of an array is the first aligned input array member?
Thanks again, Dave
|
|
|
|
 |
|
 |
Well, this is my code:
void FindMinMaxC(short* pnArray, int size, short& min, short& max) { max = SHRT_MIN; min = SHRT_MAX;
for ( int i = 0; i < size; i++ ) { if ( *pnArray < min ) min = *pnArray;
if ( *pnArray > max ) max = *pnArray;
pnArray++; } }
void FindMinMaxSSE(short* pnArray, int size, short& min, short& max) { int i;
union u { __m64 m; short n[4]; } x;
for ( i = 0; i < 4; i++ ) x.n[i] = SHRT_MIN;
__m64 max64 = x.m;
for ( i = 0; i < 4; i++ ) x.n[i] = SHRT_MAX;
__m64 min64 = x.m;
__m64* pSource = (__m64*) pnArray;
for ( i = 0; i < size/4; i++ ) { min64 = _mm_min_pi16(*pSource, min64); max64 = _mm_max_pi16(*pSource, max64);
pSource++; }
x.m = min64; min = min(x.n[0], min(x.n[1], min(x.n[2], x.n[3])));
x.m = max64; max = max(x.n[0], max(x.n[1], max(x.n[2], x.n[3]))); }
I don't care about alignment and array size in the FindMinMaxSSE function, assuming that client does this. Test results for 1000000 members: C++ 20 ms SSE 7 ms
Testing for 10000 members I get 0 in both cases.
Tests must be done in Release configuration. Again, there is no need to use SSE for small arrays. It doesn't matter that you call function many times. Array must be very long to get performance boost from SSE. In your case, use C++ code.
|
|
|
|
 |
 | AMD support Jens froslev-nielsen | 2:20 1 Dec '04 |
|
 |
Thanks for 2 wellwritten articles (sseintro & mmxintro). Now I wonder do U - or perhaps anybody in here know how to implement/using the 3DNow technology in a same matter as shown in here?.
|
|
|
|
 |
 | q: movaps vs. movups yoaz | 9:31 4 Nov '04 |
|
 |
sorry to bother u again with beginner's questions, but i'm quite stuck. I have a class using SSE. I'm declaring a member private variable:
__declspec(align(16))unsigned char m_nodes[ARRAY_SIZE];
later on i try to use it in an asm block,
movaps xmm0, [esi]
with esi pointing to the array base address. This however throws an exception, which is because the array is not aligned (the base address should be a multiple of 16, am i right?). I can't figure it out. why isn't my array aligned? another, final, question: do you know, or can u point me to the actual performance difference between movaps and movups
thanks
there are no facts, only interpretations
|
|
|
|
 |
|
 |
1) What is ARRAY_SIZE value? Why variable type is unsigned char and not float? What exception exactly do you have? 2) Take a look at Assembly code generated by C++ compiler from movaps and movups.
|
|
|
|
 |
|
 |
Alex Farber wrote:
What is ARRAY_SIZE value?
it's an int, value=16
Alex Farber wrote:
Why variable type is unsigned char and not float?
I want to use SSE2 for SIMD operations on 16 bytes
Alex Farber wrote:
What exception exactly do you have?
SEHException. But it occurs with movaps and not with movups.
Alex Farber wrote:
Take a look at Assembly code generated by C++ compiler from movaps and movups.
i was hoping to generate the Assembly code myself (working with inline Assembly), but i'll debug again.
Thanks a lot for the suggestions. I've managed to work around this, by using _aligned_malloc, though I have no idea why this aligns member variables, and __declspec(align(16)) doesn't. Any ideas?
thanks again,
there are no facts, only interpretations
|
|
|
|
 |
 | Excelent! + a question yoaz | 0:52 20 Sep '04 |
|
 |
A realy interesting and enlightening article. I have a small question: as I understand, MMX uses mm0-mm7 registers, which are actually CPU floating point registers, whereas SSE/2 uses xmm0-xmm7 registers, which where especially defined for SIMD purposes. And finally, the question(s) -
- Does this mean that I can use both types of registers simultaneously?
- Does this mean that I can do without the
EMMS instruction when writing pure SSE/2 code?
thanks, I realy enjoyed this article
there are no facts, only interpretations
|
|
|
|
 |
|
 |
AFAIK, EMMS instruction must be used only with MMX:
The EMMS instruction must be used to clear the MMX™ technology state at the end of all MMX™ technology routines and before calling other procedures or subroutines that may execute floating-point instructions. If a floating-point instruction loads one of the registers in the FPU register stack before the FPU tag word has been reset by the EMMS instruction, a floating-point stack overflow can occur that will result in a floating-point exception or incorrect result.
SSE doesn't require this instruction. I don't have experience in using SSE2.
|
|
|
|
 |