 |
|
 |
Hi: After reading yor article, I knew how to write a simple sse code. But if my source data is unsigned short. How to make it more effiently? Because after aligning the unsigned short data, it still need the same number count of for loop such as: to divide Source1 by Source2 it needs 10 count of for loop when using sse unsigned short Source1[40] unsigned short Source1[40]
but if I can pack it into float, maybe it can only need 5 count of loop does it? and how to make it? thanks
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Working with integers, you need MMX or SSE2, and not SSE. This is MMX introduction: http://www.codeproject.com/KB/recipes/mmxintro.aspx[^]
SSE2 has the same capabilities of integer operations, but has larger registers. Anyway, these technologies are used for huge arrays, there is no sence to use them for small amount of data. Also, on modern computers SSE and MMX do not give such significant performance boost, like on old Pentium III.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
The SSE C++ and inline assembly timings include two optimizations; that due to SSE calculation s and that due to a factor of 4 loop unrolling. It will be enlightening to compare a loop unrolled native C++ timing with the other two methods.
Jeff
Jeff
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
 |
|
 |
I have decided to use MSDN insturction for SSE2 programing.
I load up 8 , 16 bit short number into _m128i variable as follow:
_declspec(align(16)) short t1[100000]; _declspec(align(16)) short t2[100000]; __m128i temp1, temp2; __m128i mul1,mul2;
temp1 = _mm_load_si128((__m128i*) ((short *) &t1[i])); temp2 = _mm_load_si128((__m128i*) ((short *) &t2[i]));
then I use the _mm_mullo_epi16 function to get the multipication of my variables.
mul1 = _mm_mullo_epi16(temp1,temp2);
So now, I have the lower 16 bit of 32 bit result in mul1. now, I want to be able to add this 8 - 16 bit short values together or be able to seperate them.
I can not find any instruction whcih lets me do that 
Can some one plzzzzzzzzzzzzzzzzz help me.
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
|
 |
|
 |
Excellent article!
I'm very new to this topic and have a question. When using SSE, does the number of iterations of each loop always have to be a multiple of 4? Lets say you need to do a check (if statement inside the loop) at every iteration, is there a way to use SSE? or is there any use using it?
Thanks in advance!
Regards, Sachini
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
There is no any restriction on number of iterations, but every iteration works with 4 float numbers. This means, array size should be multiple of 4, and number of iterations is array size/4.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Well, actually, arrays do not need to be a multiple of 4. What you can do is for the portion that is a multiple of 4, do the SSE instructions, and with what's left over, do the regular way without SSE (which will be at max 3 iterations). This lets your optimization be dynamic across multiple array sizes.
So say you want to mess with an array of size 37. The first 36 you do with the SSE implementation, the last 1 you do with the normal implementation (without SSE).
It was a great question that wasn't addressed in the article. It's best practice to assume when creating such a function using SSE that it allows for arrays of any size.
--- punkbuster
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
hi, i am new to SSE, and is wondering, if _m128 data type is "auto-aligned" why doing a new _m128[xx] is not aligned ? I seems to have to use _aligned_malloc instead ? thanks in advance
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I have never try to make _m128[] array, I don't know exactly whether it is aligned or not. What is a purpose to make such array? We need _m128 variable to work with SSE registers, input and output vectors should be kept in float array.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Thanks for the article.
I executed your sample apps, and there is a significant performance boost when using SSE instead of just C++. However, the functions I've written in with SSE intrinsics have been taking 2-3 times as long to execute as their C++ counterparts. Do you know what might cause this?
Below is a function I wrote to get the minimum and maximum values of an array. This executes in roughly 80-90 microseconds on an array of 640 numbers. The C++ function that does the same thing takes 28-31 microseconds. What gives? The SSE version has to do the memcpy to get the input array aligned correctly, but this only accounts for about 26 microseconds of the difference. I realize that I'm using shorts instead of floats, but it should still work. I converted your SSESample program to use shorts and only calculate the min and max of the input array. The SSE code executed less than twice as fast as the C++ code after that, but it was still faster.
Here's the code: <code> void FindArrayMinMax(short *pnArray, long nCount, short &nMin, short &nMax) { short *pnIn = (short*) _aligned_malloc(nCount*sizeof(short), 16); // 16-byte aligned for SSE memcpy(pnIn, pnArray, nCount*sizeof(short)); long nOutputSize = 4 + nCount%4; short *pnMaxOut = (short*) _aligned_malloc(nOutputSize*sizeof(short), 16); short *pnMinOut = (short*) _aligned_malloc(nOutputSize*sizeof(short), 16);
__m64 *pmIter = (__m64*) pnIn; __m64 *pmMax = (__m64*) pnMaxOut; __m64 *pmMin = (__m64*) pnMinOut;
*pmMax = *pmMin = *pmIter; // save first 4 values as minima and maxima long nLoop = nCount/4; for (int i=1; i<nLoop; i++) { // get next 4 values and compare them to the saved minima and maxima pmIter++; *pmMax = _mm_max_pi16(*pmIter, *pmMax); *pmMin = _mm_min_pi16(*pmIter, *pmMin); }
// get strays, in case nCount is not a multiple of 4 short nVal(0); nLoop = nCount % 4; for (i=1; i<=nLoop; i++) { nVal = pnIn[nCount-i]; pnMaxOut[i+3] = nVal; pnMinOut[i+3] = nVal; }
// get max and min indices nMax = pnMaxOut[0]; nMin = pnMinOut[0]; for (i=1; i<nOutputSize; i++) { if (nMax < pnMaxOut[i]) nMax = pnMaxOut[i]; if (nMin > pnMinOut[i]) nMin = pnMinOut[i]; }
// cleanup _aligned_free(pnIn); _aligned_free(pnMinOut); _aligned_free(pnMaxOut); _mm_empty(); } </code>
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
640 is not significant number to use SSE. You need to do this for very long arrays, whuch are used in image processing, graphics, 3D etc. My second sample shows how to find minimum and maximum, I don't see something similar in your code. Does it give right result? Instead of copying of the whole array to aligned array, you need to start from the first aligned input array member. Anyway, you need to use MMX for this short numbers, take a look at my MMX article. On Pentium 4 you can use SSE2. Sorry that I don't try to understand your code, SSE programming takes a lot of time. I can try to do this, but code must be clear, without float-short tricks.
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
Thanks for your response. I realize that 640 is not a lot of elements, but this function is called many, many, times and it is slowing down my app. I do use code similar to yours to find the min and max, except that I'm using _mm_min/max_pi16 instead of _mm_min/max_ps. It does return the correct result; I've checked it against the C++ version of the function. There aren't min and max functions in MMX, but I was able to get it working by using the greater than function. Unfortunately, it takes more instructions and is a little slower than SSE. I don't know what you mean by "float-short" tricks in my code. There were no floats at all in the code that I posted. You don't have to read my code if you don't want to. The example I posted isn't the only time I've had SSE code run slower than C++. I just thought you or someone else might have some ideas why SSE in general would run slower than normal C++ code.
How do you determine which element of an array is the first aligned input array member?
Thanks again, Dave
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Well, this is my code:
void FindMinMaxC(short* pnArray, int size, short& min, short& max) { max = SHRT_MIN; min = SHRT_MAX;
for ( int i = 0; i < size; i++ ) { if ( *pnArray < min ) min = *pnArray;
if ( *pnArray > max ) max = *pnArray;
pnArray++; } }
void FindMinMaxSSE(short* pnArray, int size, short& min, short& max) { int i;
union u { __m64 m; short n[4]; } x;
for ( i = 0; i < 4; i++ ) x.n[i] = SHRT_MIN;
__m64 max64 = x.m;
for ( i = 0; i < 4; i++ ) x.n[i] = SHRT_MAX;
__m64 min64 = x.m;
__m64* pSource = (__m64*) pnArray;
for ( i = 0; i < size/4; i++ ) { min64 = _mm_min_pi16(*pSource, min64); max64 = _mm_max_pi16(*pSource, max64);
pSource++; }
x.m = min64; min = min(x.n[0], min(x.n[1], min(x.n[2], x.n[3])));
x.m = max64; max = max(x.n[0], max(x.n[1], max(x.n[2], x.n[3]))); }
I don't care about alignment and array size in the FindMinMaxSSE function, assuming that client does this. Test results for 1000000 members: C++ 20 ms SSE 7 ms
Testing for 10000 members I get 0 in both cases.
Tests must be done in Release configuration. Again, there is no need to use SSE for small arrays. It doesn't matter that you call function many times. Array must be very long to get performance boost from SSE. In your case, use C++ code.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Thanks for 2 wellwritten articles (sseintro & mmxintro). Now I wonder do U - or perhaps anybody in here know how to implement/using the 3DNow technology in a same matter as shown in here?.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
sorry to bother u again with beginner's questions, but i'm quite stuck. I have a class using SSE. I'm declaring a member private variable:
__declspec(align(16))unsigned char m_nodes[ARRAY_SIZE]; later on i try to use it in an asm block,
movaps xmm0, [esi] with esi pointing to the array base address. This however throws an exception, which is because the array is not aligned (the base address should be a multiple of 16, am i right?). I can't figure it out. why isn't my array aligned? another, final, question: do you know, or can u point me to the actual performance difference between movaps and movups
thanks
there are no facts, only interpretations
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
1) What is ARRAY_SIZE value? Why variable type is unsigned char and not float? What exception exactly do you have? 2) Take a look at Assembly code generated by C++ compiler from movaps and movups.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Alex Farber wrote: What is ARRAY_SIZE value? it's an int, value=16
Alex Farber wrote: Why variable type is unsigned char and not float? I want to use SSE2 for SIMD operations on 16 bytes
Alex Farber wrote: What exception exactly do you have?
SEHException. But it occurs with movaps and not with movups.
Alex Farber wrote: Take a look at Assembly code generated by C++ compiler from movaps and movups. i was hoping to generate the Assembly code myself (working with inline Assembly), but i'll debug again.
Thanks a lot for the suggestions. I've managed to work around this, by using _aligned_malloc, though I have no idea why this aligns member variables, and __declspec(align(16)) doesn't. Any ideas?
thanks again,
there are no facts, only interpretations
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
A realy interesting and enlightening article. I have a small question: as I understand, MMX uses mm0-mm7 registers, which are actually CPU floating point registers, whereas SSE/2 uses xmm0-xmm7 registers, which where especially defined for SIMD purposes. And finally, the question(s) -
- Does this mean that I can use both types of registers simultaneously?
- Does this mean that I can do without the
EMMS instruction when writing pure SSE/2 code?
thanks, I realy enjoyed this article
there are no facts, only interpretations
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
 |
AFAIK, EMMS instruction must be used only with MMX:
The EMMS instruction must be used to clear the MMX™ technology state at the end of all MMX™ technology routines and before calling other procedures or subroutines that may execute floating-point instructions. If a floating-point instruction loads one of the registers in the FPU register stack before the FPU tag word has been reset by the EMMS instruction, a floating-point stack overflow can occur that will result in a floating-point exception or incorrect result.
SSE doesn't require this instruction. I don't have experience in using SSE2.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
Hi
I´m trying to measure some codes (beginning to SSE) and when compiling the below code in Release (optimized for speed) in VC++ 2003 the optimizer makes some weird things (put a breakpoint at the start of the main and you will see).
#include "stdafx.h"
#include
#define SIZE 1
unsigned FindBase();
int _tmain(int argc, _TCHAR* argv[]) { SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS); __m128 Data1, Data2, Res, Data3; Res.m128_f32[0] = 0.0f; Res.m128_f32[1] = 0.0f; Res.m128_f32[2] = 0.0f; Res.m128_f32[3] = 0.0f; float Values1[] = { 1.0f, 2.0f, 3.0f, 4.0f }; float Values2[] = { 1.0f, 2.0f, 3.0f, 4.0f }; float Results[] = { 0.0f, 0.0f, 0.0f, 0.0f }; int i;
unsigned base=0, iterations=0, sum=0; unsigned cycles_high1=0, cycles_low1=0; unsigned cycles_high2=0, cycles_low2=0; unsigned __int64 temp_cycles1=0, temp_cycles2=0; __int64 total_cycles=0; double seconds=0.0L; unsigned mhz=2000; base=FindBase(); for (i=0; i __asm { pushad CPUID RDTSC mov cycles_high1, edx mov cycles_low1, eax popad }
for(int k = 0; k < 1000000000; k++) { Data1 = _mm_loadu_ps(Values1); Data2 = _mm_loadu_ps(Values2); Data3 = _mm_mul_ps(Data1, Data2); Res = _mm_add_ps(Data3, Res); _mm_storeu_ps(Results, Res); } __asm { pushad CPUID RDTSC mov cycles_high2, edx mov cycles_low2, eax popad } temp_cycles1 = ((unsigned __int64)cycles_high1 << 32) | cycles_low1; temp_cycles2 = ((unsigned __int64)cycles_high2 << 32) | cycles_low2; total_cycles += temp_cycles2 - temp_cycles1 - base; iterations++; }
SetPriorityClass(GetCurrentProcess(), NORMAL_PRIORITY_CLASS);
seconds = double(total_cycles)/double(mhz*1000000); printf("Average cycles per loop: %f\n", double(total_cycles/iterations)); printf("Average seconds per loop: %f\n", seconds/iterations); }
unsigned FindBase() { unsigned base,base_extra=0; unsigned cycles_low, cycles_high; __asm { pushad CPUID RDTSC mov cycles_high, edx mov cycles_low, eax popad pushad CPUID RDTSC popad pushad CPUID RDTSC mov cycles_high, edx mov cycles_low, eax popad pushad CPUID RDTSC popad pushad CPUID RDTSC mov cycles_high, edx mov cycles_low, eax popad pushad CPUID RDTSC sub eax, cycles_low mov base_extra, eax popad pushad CPUID RDTSC mov cycles_high, edx mov cycles_low, eax popad pushad CPUID RDTSC sub eax, cycles_low mov base, eax popad } if (base_extra < base) base = base_extra; return base; }
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
Did you try this code out with the Intel C++ 8 compiler as well? Yields muich better performance than the 10% you will get with VC 7.1, more like 350% faster!!
Regards Lars Schouw
|
| Sign In·View Thread·PermaLink | 3.50/5 |
|
|
|
 |