|
|
Comments and Discussions
|
|
 |

|
Well, this is my code:
void FindMinMaxC(short* pnArray, int size, short& min, short& max)
{
max = SHRT_MIN;
min = SHRT_MAX;
for ( int i = 0; i < size; i++ )
{
if ( *pnArray < min )
min = *pnArray;
if ( *pnArray > max )
max = *pnArray;
pnArray++;
}
}
void FindMinMaxSSE(short* pnArray, int size, short& min, short& max)
{
int i;
union u
{
__m64 m;
short n[4];
} x;
for ( i = 0; i < 4; i++ )
x.n[i] = SHRT_MIN;
__m64 max64 = x.m;
for ( i = 0; i < 4; i++ )
x.n[i] = SHRT_MAX;
__m64 min64 = x.m;
__m64* pSource = (__m64*) pnArray;
for ( i = 0; i < size/4; i++ )
{
min64 = _mm_min_pi16(*pSource, min64);
max64 = _mm_max_pi16(*pSource, max64);
pSource++;
}
x.m = min64;
min = min(x.n[0],
min(x.n[1],
min(x.n[2],
x.n[3])));
x.m = max64;
max = max(x.n[0],
max(x.n[1],
max(x.n[2],
x.n[3])));
}
I don't care about alignment and array size in the FindMinMaxSSE function, assuming that client does this.
Test results for 1000000 members:
C++ 20 ms
SSE 7 ms
Testing for 10000 members I get 0 in both cases.
Tests must be done in Release configuration. Again, there is no need to use SSE for small arrays. It doesn't matter that you call function many times. Array must be very long to get performance boost from SSE. In your case, use C++ code.
|
|
|
|

|
Thanks for 2 wellwritten articles (sseintro & mmxintro).
Now I wonder do U - or perhaps anybody in here know how to implement/using the 3DNow technology in a same matter as shown in here?.
|
|
|
|

|
sorry to bother u again with beginner's questions, but i'm quite stuck.
I have a class using SSE. I'm declaring a member private variable:
__declspec(align(16))unsigned char m_nodes[ARRAY_SIZE];
later on i try to use it in an asm block,
movaps xmm0, [esi]
with esi pointing to the array base address. This however throws an exception, which is because the array is not aligned (the base address should be a multiple of 16, am i right?).
I can't figure it out. why isn't my array aligned?
another, final, question: do you know, or can u point me to the actual performance difference between movaps and movups
thanks
there are no facts, only interpretations
|
|
|
|

|
1) What is ARRAY_SIZE value? Why variable type is unsigned char and not float? What exception exactly do you have?
2) Take a look at Assembly code generated by C++ compiler from movaps and movups.
|
|
|
|

|
Alex Farber wrote:
What is ARRAY_SIZE value?
it's an int, value=16
Alex Farber wrote:
Why variable type is unsigned char and not float?
I want to use SSE2 for SIMD operations on 16 bytes
Alex Farber wrote:
What exception exactly do you have?
SEHException. But it occurs with movaps and not with movups.
Alex Farber wrote:
Take a look at Assembly code generated by C++ compiler from movaps and movups.
i was hoping to generate the Assembly code myself (working with inline Assembly), but i'll debug again.
Thanks a lot for the suggestions. I've managed to work around this, by using _aligned_malloc, though I have no idea why this aligns member variables, and __declspec(align(16)) doesn't. Any ideas?
thanks again,
there are no facts, only interpretations
|
|
|
|

|
A realy interesting and enlightening article. I have a small question: as I understand, MMX uses mm0-mm7 registers, which are actually CPU floating point registers, whereas SSE/2 uses xmm0-xmm7 registers, which where especially defined for SIMD purposes. And finally, the question(s) -
- Does this mean that I can use both types of registers simultaneously?
- Does this mean that I can do without the
EMMS instruction when writing pure SSE/2 code?
thanks, I realy enjoyed this article
there are no facts, only interpretations
|
|
|
|

|
AFAIK, EMMS instruction must be used only with MMX:
The EMMS instruction must be used to clear the MMX™ technology state at the end of all MMX™ technology routines and before calling other procedures or subroutines that may execute floating-point instructions. If a floating-point instruction loads one of the registers in the FPU register stack before the FPU tag word has been reset by the EMMS instruction, a floating-point stack overflow can occur that will result in a floating-point exception or incorrect result.
SSE doesn't require this instruction.
I don't have experience in using SSE2.
|
|
|
|

|
thanks
there are no facts, only interpretations
|
|
|
|

|
Hi
I´m trying to measure some codes (beginning to SSE) and when compiling the below code in Release (optimized for speed) in VC++ 2003 the optimizer makes some weird things (put a breakpoint at the start of the main and you will see).
// SSE.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include
// *** BEGIN OF INCLUDE SECTION 1
// *** INCLUDE THE FOLLOWING DEFINE STATEMENTS FOR MSVC++ 5.0
//#define CPUID __asm __emit 0fh __asm __emit 0a2h
//#define RDTSC __asm __emit 0fh __asm __emit 031h
// *** END OF INCLUDE SECTION 1
#define SIZE 1
// *** BEGIN OF INCLUDE SECTION 2
// *** INCLUDE THE FOLLOWING FUNCTION DECLARATION AND CORRESPONDING
// *** FUNCTION (GIVEN BELOW)
unsigned FindBase();
// *** END OF INCLUDE SECTION 2
int _tmain(int argc, _TCHAR* argv[])
{
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
__m128 Data1, Data2, Res, Data3;
Res.m128_f32[0] = 0.0f;
Res.m128_f32[1] = 0.0f;
Res.m128_f32[2] = 0.0f;
Res.m128_f32[3] = 0.0f;
float Values1[] = { 1.0f, 2.0f, 3.0f, 4.0f };
float Values2[] = { 1.0f, 2.0f, 3.0f, 4.0f };
float Results[] = { 0.0f, 0.0f, 0.0f, 0.0f };
int i;
// *** BEGIN OF INCLUDE SECTION 3
// *** INCLUDE THE FOLLOWING DECLARATIONS IN YOUR CODE
// *** IMMEDIATELY AFTER YOUR DECLARATION SECTION.
unsigned base=0, iterations=0, sum=0;
unsigned cycles_high1=0, cycles_low1=0;
unsigned cycles_high2=0, cycles_low2=0;
unsigned __int64 temp_cycles1=0, temp_cycles2=0;
__int64 total_cycles=0; // Stored signed so it can be converted
// to a double for viewing
double seconds=0.0L;
unsigned mhz=2000; // If you want a seconds count instead
// of just cycles, enter the MHz of your
// machine in this variable.
base=FindBase();
// *** END OF INCLUDE SECTION 3
for (i=0; i
|
|
|
|

|
Did you try this code out with the Intel C++ 8 compiler as well?
Yields muich better performance than the 10% you will get with VC 7.1, more like 350% faster!!
Regards
Lars Schouw
|
|
|
|

|
The only Intel program I was working with is IPL (Image Processing Library). It is so good that I beleive you.
|
|
|
|

|
Can we use SSE to handle double type data? Thanks!
|
|
|
|

|
AFAIK, we cannot do this.
|
|
|
|
|

|
Double-precision floating point requires SSE2. If you run on capable hardware, it works mostly the same as SSE, except:
- Doubles are twice as big, so half as many values fit in each register.
- Where single-precision intrinsics typically end with _ss or _ps (scalar-single or packed-single), double-precision arithmetic intrinsics end with _sd or _pd (scalar-double or packed-double).
- You will find that the compiler often wants __m128d data types, which represent double-precision vectors.
- You'll probably need to #include <xmmintrin.h>
Take a good look at intrin.h and xmmintrin.h to get a good idea of the operations available.
|
|
|
|

|
Alex, if you ever had publish this article 3 months ago, It will ease my headache on processor optimization.
Good article.
Crystal Silver Codes
vleong@first.net.my
|
|
|
|

|
Anyone knows? how to convert the prewritten code into linux icc?
|
|
|
|

|
According to this document:
http://www.tacc.utexas.edu/resources/user_guides/intel/c_ug_lnx.pdf
SSE is supported exactly like in VC 7.1. However, I suggest you to ask this question in some non-Visual C++ forum, where Linux and Unix programmers can help you. For example:
http://www.codeguru.com/forum/forumdisplay.php?s=&forumid=9
|
|
|
|

|
Not sure about using the icc, but for gcc you can compile the same code found here by using the -msse switch.
|
|
|
|

|
wrong, you have not __m128 in GNU and intrisic functions are different
to have it :
typedef float __m128 __attribute__( ( mode( V4SF ), aligned( 16 ) ) ); // supposedly __m128 must be aligned
to add two __m128 variables a and b :
__m128 c = __builtin_ia32_addps( a, b );
as you can see it, you must prefix your SSE instruction with "__builtin_ia32_" to execute it, which is definitely not compatible with MS intrinsic functions.
By the way, y = _mm_set_ps1( x ) is "y = __builtin_ia32_loadss( x ); y = __builtin_shufps( y, y, 0 );"
/chris
|
|
|
|

|
Actually I do have an __m128 data type for floating point operations. I am using gcc 3.3.1. Actually I have learned a couple of tricks since my last post here that others might find useful. I am typeing this code from memory so it my not be perfect. But the basic idea is to use a union to hold both the __m128 data type and a float [4] data type. Remember that unions datatypes all occupy the same memory address and the type is switched depending on how you use it.
union myDataType {
__m128 vec;
float arr[4] __attribute__aligned( 16 )));
}
Now the data can be stored interchangably in both. TO acess the __m128 data type use myDataType.vec and to access arr use myDataType.arr;
Good luck to everyone out there. BTW take what I say with a grain of salt. I claim to be no guru just someone who loves to code and figure stuff out.
Thanks for the great discussion and great site.
-gnuLNX
|
|
|
|

|
hummm... I'm using a GCC 3.2.3 but fail to have __m128. Are you sure you don't need to include a header file ?
I strongly discourage you to use a union, because it turns off some optimizations ( I tried a lot of tricks with SSE builtins ).
For those who are interested by real SSE optimization in C++ :
////////////////////////////////////////////////////////////////////////////////
#define always_inline inline __attribute__( ( always_inline ) )
////////////////////////////////////////////////////////////////////////////////
typedef float __v4sf __attribute__( ( mode( V4SF ), aligned( 16 ) ) );
////////////////////////////////////////////////////////////////////////////////
struct v4sf
{
__v4sf v;
///
always_inline
v4sf( ) { }
always_inline
v4sf( __v4sf _1 ) : v( _1 ) { }
always_inline
operator __v4sf( ) const { return v; }
};
always_inline
v4sf operator +( v4sf _1, v4sf _2 )
{ return __builtin_ia32_addps( _1.v, _2.v ); }
always_inline
v4sf operator +( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_addps( _1, _2.v ); }
always_inline
v4sf operator +( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_addps( _1.v, _2 ); }
always_inline
v4sf operator -( v4sf _1, v4sf _2 )
{ return __builtin_ia32_subps( _1.v, _2.v ); }
always_inline
v4sf operator -( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_subps( _1, _2.v ); }
always_inline
v4sf operator -( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_subps( _1.v, _2 ); }
always_inline
v4sf operator *( v4sf _1, v4sf _2 )
{ return __builtin_ia32_mulps( _1.v, _2.v ); }
always_inline
v4sf operator *( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_mulps( _1, _2.v ); }
always_inline
v4sf operator *( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_mulps( _1.v, _2 ); }
always_inline
v4sf operator /( v4sf _1, v4sf _2 )
{ return __builtin_ia32_divps( _1.v, _2.v ); }
always_inline
v4sf operator /( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_divps( _1, _2.v ); }
always_inline
v4sf operator /( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_divps( _1.v, _2 ); }
////////////////////////////////////////////////////////////////////////////////
Now using "struct v4sf" would help compiler to allocate SSE registers without putting v in stack. Using a union would prevent compiler from register optimizations and put v in stack even if a SSE register were more appropriate.
v4sf a,b,d;
void f()
{
d = a * ( d + b );
}
that gives us :
65: 0f 28 3d 20 00 00 00 movaps 0x20,%xmm7
6c: 0f 28 35 00 00 00 00 movaps 0x0,%xmm6
73: 0f 58 3d 10 00 00 00 addps 0x10,%xmm7
7a: 0f 59 f7 mulps %xmm7,%xmm6
7d: 0f 29 35 20 00 00 00 movaps %xmm6,0x20
Now if we replace :
struct v4sf
{
union { __v4sf v; float f[4]; }
...
that gives us ( what a ugly code ! ) :
65: a1 00 00 00 00 mov 0x0,%eax
6a: 89 45 d8 mov %eax,0xffffffd8(%ebp)
6d: a1 04 00 00 00 mov 0x4,%eax
72: 89 45 dc mov %eax,0xffffffdc(%ebp)
75: a1 08 00 00 00 mov 0x8,%eax
7a: 89 45 e0 mov %eax,0xffffffe0(%ebp)
7d: a1 0c 00 00 00 mov 0xc,%eax
82: 89 45 e4 mov %eax,0xffffffe4(%ebp)
85: 0f 28 75 d8 movaps 0xffffffd8(%ebp),%xmm6
89: a1 20 00 00 00 mov 0x20,%eax
8e: 89 45 b8 mov %eax,0xffffffb8(%ebp)
91: a1 24 00 00 00 mov 0x24,%eax
96: 89 45 bc mov %eax,0xffffffbc(%ebp)
99: a1 28 00 00 00 mov 0x28,%eax
9e: 89 45 c0 mov %eax,0xffffffc0(%ebp)
a1: a1 2c 00 00 00 mov 0x2c,%eax
a6: 89 45 c4 mov %eax,0xffffffc4(%ebp)
a9: 0f 28 7d b8 movaps 0xffffffb8(%ebp),%xmm7
ad: a1 10 00 00 00 mov 0x10,%eax
b2: 89 45 a8 mov %eax,0xffffffa8(%ebp)
b5: a1 14 00 00 00 mov 0x14,%eax
ba: 89 45 ac mov %eax,0xffffffac(%ebp)
bd: a1 18 00 00 00 mov 0x18,%eax
c2: 89 45 b0 mov %eax,0xffffffb0(%ebp)
c5: a1 1c 00 00 00 mov 0x1c,%eax
ca: 89 45 b4 mov %eax,0xffffffb4(%ebp)
cd: 0f 58 7d a8 addps 0xffffffa8(%ebp),%xmm7
d1: 0f 29 7d c8 movaps %xmm7,0xffffffc8(%ebp)
d5: 0f 59 75 c8 mulps 0xffffffc8(%ebp),%xmm6
d9: 0f 29 75 e8 movaps %xmm6,0xffffffe8(%ebp)
dd: 8b 45 e8 mov 0xffffffe8(%ebp),%eax
e0: a3 20 00 00 00 mov %eax,0x20
e5: 8b 45 ec mov 0xffffffec(%ebp),%eax
e8: a3 24 00 00 00 mov %eax,0x24
ed: 8b 45 f0 mov 0xfffffff0(%ebp),%eax
f0: a3 28 00 00 00 mov %eax,0x28
f5: 8b 45 f4 mov 0xfffffff4(%ebp),%eax
f8: a3 2c 00 00 00 mov %eax,0x2c
So you shouldn't mix things like it.
even :
struct v4sf
{
union { __v4sf v; float f[4] __attribute( ( aligned( 16 ) ) ); }
...
or :
struct v4sf
{
union { __v4sf v; float f[4]; } __attribute( ( aligned( 16 ) ) );
...
don't change anything.
To access a 4 floats, just create another class float4 with conversion operator between v4sf and float4.
Oh yeah, flags were :
-march=athlon-xp
-fomit-frame-pointer
-mfpmath=sse
-O6
|
|
|
|

|
First off let me say that I enjoyed looking over your code...some nice things you have done. I have a couple of answers for you as well as a couple of questions to help me better understand what is going on.
You are totally correct the 3.2 tree does not have the __m128 intrinsic. However the 3.3.1 tree does. Here is an small peice from the xmmintrin.h header file. I switched to 3.3.1 for this very reason. Also the only reason I ever used a union was so that I could easily read and write the xmm data to main memory. May or may not be a good idea....I still want to do some benchmarks against your very cool code!
static __inline __m128
_mm_add_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
}
static __inline __m128
_mm_sub_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_subss ((__v4sf)__A, (__v4sf)__B);
}
static __inline __m128
_mm_mul_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_mulss ((__v4sf)__A, (__v4sf)__B);
}
static __inline __m128
_mm_div_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_divss ((__v4sf)__A, (__v4sf)__B);
}
static __inline __m128
_mm_sqrt_ss (__m128 __A)
{
return (__m128) __builtin_ia32_sqrtss ((__v4sf)__A);
}
static __inline __m128
_mm_rcp_ss (__m128 __A)
{
return (__m128) __builtin_ia32_rcpss ((__v4sf)__A);
}
Now down to the business of structures vs unions?
I assume that you used the structure to build your __v4sf datatype correct? Since this datatype "is" includeded in "3.3.1" then do we still need to take your route. Again I am certainly no guru but I am definely getting some descent speed gains in my code. Now back to something I saw in your assembly dumb that caught me attention. You code does seem to make use of much more than simply xmm0 registers, but when I compile your code in same manner ommiting the atholon switch I still only use register xmm0? In fact no matter how I compile it I still only use one xmm register. Do you have any thoughts on this?
Also since the 3.3.1 tree does have __m128 data types would you be willing to do a similar structure overloading * and + and such.
Thanks again for youo valuable info and sorry upfront if I sound like I have no clue....probably because I really don't!
|
|
|
|

|
I'm using the last release of Dev-C++ with GCC 3.2.3, because I dislike VC 6.0/7.0 which are not compliant with ISO C++, especially because I'm a C++ guru and like to use templates in unusual ways. So compatibility with xmmintrin.h is not something I care about. Formerly, I was an assembly coder, and found the special inclusion of asm statements in GCC to be the best I have ever seen : you can let the compiler to choose which registers to allocate or to use in such a way that global optimizations can happen. Not something you can really do with VC 6.0/7.0. But I admit that having xmmintrin.h helps us to reuse existing code using it.
My warning about not using union comming from what you mustn't have an array of float in a struct if you want for your compiler to use a register instead of memory slots in stack for your vectors of float, because an array means memory slots.
Why using a struct ? well it is the only way allowing us to use operators with in fact. But I must admit that code generated that way is not always very good especially with complex expression.
I tried the same code with -march=pentium3 (minimal for having SSE) instead of -march=athlon-xp and found the same result.
Here my flags I added :
-march=pentium3 (SSE only) / -march=pentium4 (SSE,SSE2) / -march=athlon-xp (SSE, 3Dnow!, Ext3DNow! )
-fomit-frame-pointer
-mfpmath=sse
-O6
-fssa
-fssa-dce
-fssa-ccp
-fprefetch-loop-arrays
otherwise you may need to add -msse ( I don't need it apparently )
How to feed a struct v4sf with an array of float ?
////////////////////////////////////////////////////////////////////////////////
typedef float __v4sf __attribute__( ( mode( V4SF ), aligned( 16 ) ) );
////////////////////////////////////////////////////////////////////////////////
struct v4sf
{
__v4sf v;
///
always_inline
v4sf( ) { }
always_inline
v4sf( __v4sf _1 )
: v( _1 ) { }
always_inline
v4sf( float const *_1 )
: v( __builtin_ia32_loadups( ( float * )_1 ) ) { }
always_inline
operator __v4sf( ) const { return v; }
};
struct float4
{
union { float v[4]; };
///
always_inline
float4( ) { }
always_inline
float4( __v4sf _1 )
{ __builtin_ia32_storeups( v, _1 ); }
always_inline
operator __v4sf( ) const { return v4sf( v ); }
};
...
float const f[4] = { 1.0, -1.0, 1.0, -1.0 };
__v4sf compute( __v4sf a, __v4sf b, __v4sf c )
{
return a * ( v4sf( b ) + c ) - f;
}
gives us :
000000a0 <__Z7computeU8__vectorfS_S_>:
a0: 0f 28 54 24 14 movaps 0x14(%esp,1),%xmm2
a5: 0f 28 44 24 04 movaps 0x4(%esp,1),%xmm0
aa: 0f 58 54 24 24 addps 0x24(%esp,1),%xmm2
af: 0f 59 c2 mulps %xmm2,%xmm0
b2: 0f 10 15 04 01 00 00 movups 0x104,%xmm2
b9: 0f 5c c2 subps %xmm2,%xmm0
bc: c3 ret
My mail address is paul.suade@laposte.net.
|
|
|
|

|
Hey Paul thanks a lot. You have given me a lot to think about. I should mention that I am using gcc with linux and not with microsoft....BTW does it integrate well with Dev-C++. I think I am going to be doing some windows coding in the near future.
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.
|
An article describes programming floating-point calculations using Streaming SIMD Extensions
| Type | Article |
| Licence | CPOL |
| First Posted | 10 Jul 2003 |
| Views | 367,757 |
| Bookmarked | 109 times |
|
|