|
|
Comments and Discussions
|
|
 |

|
Alex Farber wrote:
What is ARRAY_SIZE value?
it's an int, value=16
Alex Farber wrote:
Why variable type is unsigned char and not float?
I want to use SSE2 for SIMD operations on 16 bytes
Alex Farber wrote:
What exception exactly do you have?
SEHException. But it occurs with movaps and not with movups.
Alex Farber wrote:
Take a look at Assembly code generated by C++ compiler from movaps and movups.
i was hoping to generate the Assembly code myself (working with inline Assembly), but i'll debug again.
Thanks a lot for the suggestions. I've managed to work around this, by using _aligned_malloc, though I have no idea why this aligns member variables, and __declspec(align(16)) doesn't. Any ideas?
thanks again,
there are no facts, only interpretations
|
|
|
|

|
A realy interesting and enlightening article. I have a small question: as I understand, MMX uses mm0-mm7 registers, which are actually CPU floating point registers, whereas SSE/2 uses xmm0-xmm7 registers, which where especially defined for SIMD purposes. And finally, the question(s) -
- Does this mean that I can use both types of registers simultaneously?
- Does this mean that I can do without the
EMMS instruction when writing pure SSE/2 code?
thanks, I realy enjoyed this article
there are no facts, only interpretations
|
|
|
|

|
AFAIK, EMMS instruction must be used only with MMX:
The EMMS instruction must be used to clear the MMX™ technology state at the end of all MMX™ technology routines and before calling other procedures or subroutines that may execute floating-point instructions. If a floating-point instruction loads one of the registers in the FPU register stack before the FPU tag word has been reset by the EMMS instruction, a floating-point stack overflow can occur that will result in a floating-point exception or incorrect result.
SSE doesn't require this instruction.
I don't have experience in using SSE2.
|
|
|
|

|
thanks
there are no facts, only interpretations
|
|
|
|

|
Hi
I´m trying to measure some codes (beginning to SSE) and when compiling the below code in Release (optimized for speed) in VC++ 2003 the optimizer makes some weird things (put a breakpoint at the start of the main and you will see).
// SSE.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include
// *** BEGIN OF INCLUDE SECTION 1
// *** INCLUDE THE FOLLOWING DEFINE STATEMENTS FOR MSVC++ 5.0
//#define CPUID __asm __emit 0fh __asm __emit 0a2h
//#define RDTSC __asm __emit 0fh __asm __emit 031h
// *** END OF INCLUDE SECTION 1
#define SIZE 1
// *** BEGIN OF INCLUDE SECTION 2
// *** INCLUDE THE FOLLOWING FUNCTION DECLARATION AND CORRESPONDING
// *** FUNCTION (GIVEN BELOW)
unsigned FindBase();
// *** END OF INCLUDE SECTION 2
int _tmain(int argc, _TCHAR* argv[])
{
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
__m128 Data1, Data2, Res, Data3;
Res.m128_f32[0] = 0.0f;
Res.m128_f32[1] = 0.0f;
Res.m128_f32[2] = 0.0f;
Res.m128_f32[3] = 0.0f;
float Values1[] = { 1.0f, 2.0f, 3.0f, 4.0f };
float Values2[] = { 1.0f, 2.0f, 3.0f, 4.0f };
float Results[] = { 0.0f, 0.0f, 0.0f, 0.0f };
int i;
// *** BEGIN OF INCLUDE SECTION 3
// *** INCLUDE THE FOLLOWING DECLARATIONS IN YOUR CODE
// *** IMMEDIATELY AFTER YOUR DECLARATION SECTION.
unsigned base=0, iterations=0, sum=0;
unsigned cycles_high1=0, cycles_low1=0;
unsigned cycles_high2=0, cycles_low2=0;
unsigned __int64 temp_cycles1=0, temp_cycles2=0;
__int64 total_cycles=0; // Stored signed so it can be converted
// to a double for viewing
double seconds=0.0L;
unsigned mhz=2000; // If you want a seconds count instead
// of just cycles, enter the MHz of your
// machine in this variable.
base=FindBase();
// *** END OF INCLUDE SECTION 3
for (i=0; i
|
|
|
|

|
Did you try this code out with the Intel C++ 8 compiler as well?
Yields muich better performance than the 10% you will get with VC 7.1, more like 350% faster!!
Regards
Lars Schouw
|
|
|
|

|
The only Intel program I was working with is IPL (Image Processing Library). It is so good that I beleive you.
|
|
|
|

|
Can we use SSE to handle double type data? Thanks!
|
|
|
|

|
AFAIK, we cannot do this.
|
|
|
|
|

|
Double-precision floating point requires SSE2. If you run on capable hardware, it works mostly the same as SSE, except:
- Doubles are twice as big, so half as many values fit in each register.
- Where single-precision intrinsics typically end with _ss or _ps (scalar-single or packed-single), double-precision arithmetic intrinsics end with _sd or _pd (scalar-double or packed-double).
- You will find that the compiler often wants __m128d data types, which represent double-precision vectors.
- You'll probably need to #include <xmmintrin.h>
Take a good look at intrin.h and xmmintrin.h to get a good idea of the operations available.
|
|
|
|

|
Alex, if you ever had publish this article 3 months ago, It will ease my headache on processor optimization.
Good article.
Crystal Silver Codes
vleong@first.net.my
|
|
|
|

|
Anyone knows? how to convert the prewritten code into linux icc?
|
|
|
|

|
According to this document:
http://www.tacc.utexas.edu/resources/user_guides/intel/c_ug_lnx.pdf
SSE is supported exactly like in VC 7.1. However, I suggest you to ask this question in some non-Visual C++ forum, where Linux and Unix programmers can help you. For example:
http://www.codeguru.com/forum/forumdisplay.php?s=&forumid=9
|
|
|
|

|
Not sure about using the icc, but for gcc you can compile the same code found here by using the -msse switch.
|
|
|
|

|
wrong, you have not __m128 in GNU and intrisic functions are different
to have it :
typedef float __m128 __attribute__( ( mode( V4SF ), aligned( 16 ) ) ); // supposedly __m128 must be aligned
to add two __m128 variables a and b :
__m128 c = __builtin_ia32_addps( a, b );
as you can see it, you must prefix your SSE instruction with "__builtin_ia32_" to execute it, which is definitely not compatible with MS intrinsic functions.
By the way, y = _mm_set_ps1( x ) is "y = __builtin_ia32_loadss( x ); y = __builtin_shufps( y, y, 0 );"
/chris
|
|
|
|

|
Actually I do have an __m128 data type for floating point operations. I am using gcc 3.3.1. Actually I have learned a couple of tricks since my last post here that others might find useful. I am typeing this code from memory so it my not be perfect. But the basic idea is to use a union to hold both the __m128 data type and a float [4] data type. Remember that unions datatypes all occupy the same memory address and the type is switched depending on how you use it.
union myDataType {
__m128 vec;
float arr[4] __attribute__aligned( 16 )));
}
Now the data can be stored interchangably in both. TO acess the __m128 data type use myDataType.vec and to access arr use myDataType.arr;
Good luck to everyone out there. BTW take what I say with a grain of salt. I claim to be no guru just someone who loves to code and figure stuff out.
Thanks for the great discussion and great site.
-gnuLNX
|
|
|
|

|
hummm... I'm using a GCC 3.2.3 but fail to have __m128. Are you sure you don't need to include a header file ?
I strongly discourage you to use a union, because it turns off some optimizations ( I tried a lot of tricks with SSE builtins ).
For those who are interested by real SSE optimization in C++ :
////////////////////////////////////////////////////////////////////////////////
#define always_inline inline __attribute__( ( always_inline ) )
////////////////////////////////////////////////////////////////////////////////
typedef float __v4sf __attribute__( ( mode( V4SF ), aligned( 16 ) ) );
////////////////////////////////////////////////////////////////////////////////
struct v4sf
{
__v4sf v;
///
always_inline
v4sf( ) { }
always_inline
v4sf( __v4sf _1 ) : v( _1 ) { }
always_inline
operator __v4sf( ) const { return v; }
};
always_inline
v4sf operator +( v4sf _1, v4sf _2 )
{ return __builtin_ia32_addps( _1.v, _2.v ); }
always_inline
v4sf operator +( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_addps( _1, _2.v ); }
always_inline
v4sf operator +( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_addps( _1.v, _2 ); }
always_inline
v4sf operator -( v4sf _1, v4sf _2 )
{ return __builtin_ia32_subps( _1.v, _2.v ); }
always_inline
v4sf operator -( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_subps( _1, _2.v ); }
always_inline
v4sf operator -( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_subps( _1.v, _2 ); }
always_inline
v4sf operator *( v4sf _1, v4sf _2 )
{ return __builtin_ia32_mulps( _1.v, _2.v ); }
always_inline
v4sf operator *( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_mulps( _1, _2.v ); }
always_inline
v4sf operator *( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_mulps( _1.v, _2 ); }
always_inline
v4sf operator /( v4sf _1, v4sf _2 )
{ return __builtin_ia32_divps( _1.v, _2.v ); }
always_inline
v4sf operator /( __v4sf _1, v4sf _2 )
{ return __builtin_ia32_divps( _1, _2.v ); }
always_inline
v4sf operator /( v4sf _1, __v4sf _2 )
{ return __builtin_ia32_divps( _1.v, _2 ); }
////////////////////////////////////////////////////////////////////////////////
Now using "struct v4sf" would help compiler to allocate SSE registers without putting v in stack. Using a union would prevent compiler from register optimizations and put v in stack even if a SSE register were more appropriate.
v4sf a,b,d;
void f()
{
d = a * ( d + b );
}
that gives us :
65: 0f 28 3d 20 00 00 00 movaps 0x20,%xmm7
6c: 0f 28 35 00 00 00 00 movaps 0x0,%xmm6
73: 0f 58 3d 10 00 00 00 addps 0x10,%xmm7
7a: 0f 59 f7 mulps %xmm7,%xmm6
7d: 0f 29 35 20 00 00 00 movaps %xmm6,0x20
Now if we replace :
struct v4sf
{
union { __v4sf v; float f[4]; }
...
that gives us ( what a ugly code ! ) :
65: a1 00 00 00 00 mov 0x0,%eax
6a: 89 45 d8 mov %eax,0xffffffd8(%ebp)
6d: a1 04 00 00 00 mov 0x4,%eax
72: 89 45 dc mov %eax,0xffffffdc(%ebp)
75: a1 08 00 00 00 mov 0x8,%eax
7a: 89 45 e0 mov %eax,0xffffffe0(%ebp)
7d: a1 0c 00 00 00 mov 0xc,%eax
82: 89 45 e4 mov %eax,0xffffffe4(%ebp)
85: 0f 28 75 d8 movaps 0xffffffd8(%ebp),%xmm6
89: a1 20 00 00 00 mov 0x20,%eax
8e: 89 45 b8 mov %eax,0xffffffb8(%ebp)
91: a1 24 00 00 00 mov 0x24,%eax
96: 89 45 bc mov %eax,0xffffffbc(%ebp)
99: a1 28 00 00 00 mov 0x28,%eax
9e: 89 45 c0 mov %eax,0xffffffc0(%ebp)
a1: a1 2c 00 00 00 mov 0x2c,%eax
a6: 89 45 c4 mov %eax,0xffffffc4(%ebp)
a9: 0f 28 7d b8 movaps 0xffffffb8(%ebp),%xmm7
ad: a1 10 00 00 00 mov 0x10,%eax
b2: 89 45 a8 mov %eax,0xffffffa8(%ebp)
b5: a1 14 00 00 00 mov 0x14,%eax
ba: 89 45 ac mov %eax,0xffffffac(%ebp)
bd: a1 18 00 00 00 mov 0x18,%eax
c2: 89 45 b0 mov %eax,0xffffffb0(%ebp)
c5: a1 1c 00 00 00 mov 0x1c,%eax
ca: 89 45 b4 mov %eax,0xffffffb4(%ebp)
cd: 0f 58 7d a8 addps 0xffffffa8(%ebp),%xmm7
d1: 0f 29 7d c8 movaps %xmm7,0xffffffc8(%ebp)
d5: 0f 59 75 c8 mulps 0xffffffc8(%ebp),%xmm6
d9: 0f 29 75 e8 movaps %xmm6,0xffffffe8(%ebp)
dd: 8b 45 e8 mov 0xffffffe8(%ebp),%eax
e0: a3 20 00 00 00 mov %eax,0x20
e5: 8b 45 ec mov 0xffffffec(%ebp),%eax
e8: a3 24 00 00 00 mov %eax,0x24
ed: 8b 45 f0 mov 0xfffffff0(%ebp),%eax
f0: a3 28 00 00 00 mov %eax,0x28
f5: 8b 45 f4 mov 0xfffffff4(%ebp),%eax
f8: a3 2c 00 00 00 mov %eax,0x2c
So you shouldn't mix things like it.
even :
struct v4sf
{
union { __v4sf v; float f[4] __attribute( ( aligned( 16 ) ) ); }
...
or :
struct v4sf
{
union { __v4sf v; float f[4]; } __attribute( ( aligned( 16 ) ) );
...
don't change anything.
To access a 4 floats, just create another class float4 with conversion operator between v4sf and float4.
Oh yeah, flags were :
-march=athlon-xp
-fomit-frame-pointer
-mfpmath=sse
-O6
|
|
|
|

|
First off let me say that I enjoyed looking over your code...some nice things you have done. I have a couple of answers for you as well as a couple of questions to help me better understand what is going on.
You are totally correct the 3.2 tree does not have the __m128 intrinsic. However the 3.3.1 tree does. Here is an small peice from the xmmintrin.h header file. I switched to 3.3.1 for this very reason. Also the only reason I ever used a union was so that I could easily read and write the xmm data to main memory. May or may not be a good idea....I still want to do some benchmarks against your very cool code!
static __inline __m128
_mm_add_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
}
static __inline __m128
_mm_sub_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_subss ((__v4sf)__A, (__v4sf)__B);
}
static __inline __m128
_mm_mul_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_mulss ((__v4sf)__A, (__v4sf)__B);
}
static __inline __m128
_mm_div_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_divss ((__v4sf)__A, (__v4sf)__B);
}
static __inline __m128
_mm_sqrt_ss (__m128 __A)
{
return (__m128) __builtin_ia32_sqrtss ((__v4sf)__A);
}
static __inline __m128
_mm_rcp_ss (__m128 __A)
{
return (__m128) __builtin_ia32_rcpss ((__v4sf)__A);
}
Now down to the business of structures vs unions?
I assume that you used the structure to build your __v4sf datatype correct? Since this datatype "is" includeded in "3.3.1" then do we still need to take your route. Again I am certainly no guru but I am definely getting some descent speed gains in my code. Now back to something I saw in your assembly dumb that caught me attention. You code does seem to make use of much more than simply xmm0 registers, but when I compile your code in same manner ommiting the atholon switch I still only use register xmm0? In fact no matter how I compile it I still only use one xmm register. Do you have any thoughts on this?
Also since the 3.3.1 tree does have __m128 data types would you be willing to do a similar structure overloading * and + and such.
Thanks again for youo valuable info and sorry upfront if I sound like I have no clue....probably because I really don't!
|
|
|
|

|
I'm using the last release of Dev-C++ with GCC 3.2.3, because I dislike VC 6.0/7.0 which are not compliant with ISO C++, especially because I'm a C++ guru and like to use templates in unusual ways. So compatibility with xmmintrin.h is not something I care about. Formerly, I was an assembly coder, and found the special inclusion of asm statements in GCC to be the best I have ever seen : you can let the compiler to choose which registers to allocate or to use in such a way that global optimizations can happen. Not something you can really do with VC 6.0/7.0. But I admit that having xmmintrin.h helps us to reuse existing code using it.
My warning about not using union comming from what you mustn't have an array of float in a struct if you want for your compiler to use a register instead of memory slots in stack for your vectors of float, because an array means memory slots.
Why using a struct ? well it is the only way allowing us to use operators with in fact. But I must admit that code generated that way is not always very good especially with complex expression.
I tried the same code with -march=pentium3 (minimal for having SSE) instead of -march=athlon-xp and found the same result.
Here my flags I added :
-march=pentium3 (SSE only) / -march=pentium4 (SSE,SSE2) / -march=athlon-xp (SSE, 3Dnow!, Ext3DNow! )
-fomit-frame-pointer
-mfpmath=sse
-O6
-fssa
-fssa-dce
-fssa-ccp
-fprefetch-loop-arrays
otherwise you may need to add -msse ( I don't need it apparently )
How to feed a struct v4sf with an array of float ?
////////////////////////////////////////////////////////////////////////////////
typedef float __v4sf __attribute__( ( mode( V4SF ), aligned( 16 ) ) );
////////////////////////////////////////////////////////////////////////////////
struct v4sf
{
__v4sf v;
///
always_inline
v4sf( ) { }
always_inline
v4sf( __v4sf _1 )
: v( _1 ) { }
always_inline
v4sf( float const *_1 )
: v( __builtin_ia32_loadups( ( float * )_1 ) ) { }
always_inline
operator __v4sf( ) const { return v; }
};
struct float4
{
union { float v[4]; };
///
always_inline
float4( ) { }
always_inline
float4( __v4sf _1 )
{ __builtin_ia32_storeups( v, _1 ); }
always_inline
operator __v4sf( ) const { return v4sf( v ); }
};
...
float const f[4] = { 1.0, -1.0, 1.0, -1.0 };
__v4sf compute( __v4sf a, __v4sf b, __v4sf c )
{
return a * ( v4sf( b ) + c ) - f;
}
gives us :
000000a0 <__Z7computeU8__vectorfS_S_>:
a0: 0f 28 54 24 14 movaps 0x14(%esp,1),%xmm2
a5: 0f 28 44 24 04 movaps 0x4(%esp,1),%xmm0
aa: 0f 58 54 24 24 addps 0x24(%esp,1),%xmm2
af: 0f 59 c2 mulps %xmm2,%xmm0
b2: 0f 10 15 04 01 00 00 movups 0x104,%xmm2
b9: 0f 5c c2 subps %xmm2,%xmm0
bc: c3 ret
My mail address is paul.suade@laposte.net.
|
|
|
|

|
Hey Paul thanks a lot. You have given me a lot to think about. I should mention that I am using gcc with linux and not with microsoft....BTW does it integrate well with Dev-C++. I think I am going to be doing some windows coding in the near future.
|
|
|
|

|
Dev-C++ is not so bad. It is an IDE for MINGW32 GCC compiler but you can indeed change the gcc compiler if CYGWIN is needed.
I think it could be an ideal IDE for using GCC in both WIN32 and linux platforms.
But I don't think Dev-C++ can replace all the features of VC IDE.
Have a nice coding.
P.S.: I got the x/emmintri.h files from CVS. But I'm still incertain if I must use it.
|
|
|
|

|
Excellent article and excellent examples. Now I'm looking for an example project using SSE2 intrinsics. I noticed that the Swarm project from MS says that it uses both MMX and SSE2, but when I download the project, it only contains the MMX code. Do you know of any example code for SSE2 that I can look over?
-Brett
|
|
|
|

|
As I remember, SSE2 extends both MMX and SSE technologies. It allows to work with double-presicion numbers and has set of MMX-like instructions using 128 bits SSE registers. MMXSwarm sample contains such MMX-like SSE2 instructions. I beleive you can find some information making a search for SSE2 with Google, this is one good link, for example:
http://www.intel.com/update/departments/software/sw03011.pdf
Vincent Leong from CodeGuru published an article about MMX:
http://www.codeguru.com/cpp_mfc/MMXDemo.html
and promised some SSE stuff in his next article.
|
|
|
|

|
Excellent article and excellent examples. Now I'm looking for an example project using SSE2 intrinsics. I noticed that the Swarm project from MS says that it uses both MMX and SSE2, but when I download the project, it only contains the MMX code. Do you know of any example code for SSE2 that I can look over?
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
An article describes programming floating-point calculations using Streaming SIMD Extensions
| Type | Article |
| Licence | CPOL |
| First Posted | 10 Jul 2003 |
| Views | 368,453 |
| Bookmarked | 109 times |
|
|