Click here to Skip to main content
15,886,689 members
Articles / Programming Languages / ASM
Tip/Trick

Single precision floating point and double precesion floating values operations in SSE optimization

Rate me:
Please Sign up or sign in to vote.
4.33/5 (3 votes)
27 Nov 2012CPOL2 min read 18K   113   8   5
Single precision floating point and double precesion floating values operations in SSE optimization

Introduction

The Intel SSE intrinsic technology boosts the performance of floating point calculations. Both GCC and Microsoft Visual Studio supports SSE intrinsic. The xmm0-xmm15 (16 xmm registers for 64bit operating system) or xmm0-xmm7(8 xmm registers for 32 bit operating system) registers used for floating point calculations in SSE. Operations in SSE for single precision floating point and double precision floating point is a bit different. My objective is to point the differences between the calculation between these two data types using simple summation operation in floating point array.

SSE Programming

All SSE instructions and data types are defined in #include <xmmintrin.h>. __m128 is used for single precision floating point number and __m128d is used for double precision numbers. _mm_load_pd is used for loading double precision floating point number and _mm_load_ps is used loading for single precision floating point numbers. Similarly, _mm_add_ps, _mm_hadd_ps are used for adding single precision floating point numbers. Meanwhile, _mm_add_pd and _mm_hadd_pd are used for adding double precision floating point numbers. The float point array has to be aligned 16 and that can be done using _mm_malloc.

_mm_add_ps adds the four single precision floating-point values

r0 := a0 + b0
r1 := a0 + b1
r2 := a2 + b2
r3 := a3 + b3

_mm_add_pd adds the two double precision floating-point values

r0 := a0 + b0
r1 := a1 + b1

Code

This is the plain C code which we are we wish to convert codes using SSE.

float sum = 0;  //for double precision: double sum = 0;   
for (int i = 0; i < n; i++) {
    sum += scores[i];            
}

Single precision floating point number addition Sample code:

C++
float sum  = 0.0;		
__m128 rsum  = _mm_set1_ps(0.0);
for (int i = 0; i < n; i+=4)
{		
	__m128 mr  = _mm_load_ps(&a[i]);					
	rsum = _mm_add_ps(rsum, mr);
}
rsum = _mm_hadd_ps(rsum, rsum);
rsum = _mm_hadd_ps(rsum, rsum);
_mm_store_ss(&sum, rsum);

Double precision floating point number addition Sample code:

C++
double sum  = 0.0;
double sum1  = 0.0;	
__m128d rsum  = _mm_set1_pd(0.0);
__m128d rsum1  = _mm_set1_pd(0.0);	
for (int i = 0; i < n; i += 4)
{		
	__m128d mr  = _mm_load_pd(&a[i]);	
	__m128d mr1 = _mm_load_pd(&a[i+2]);			
	rsum = _mm_add_pd(rsum, mr);
	rsum1 = _mm_add_pd(rsum1, mr1);			
}
rsum = _mm_hadd_pd(rsum, rsum1);
rsum = _mm_hadd_pd(rsum, rsum);
_mm_store_sd(&sum, rsum);

You can see the difference between single precision float and double precision float is that you can add 4 values in one operation of single precision floating point number

C++
rsum = _mm_add_ps(rsum, mr);

You can add 2 values in one operation and therefore you need two operations for 4 values

C++
rsum = _mm_add_pd(rsum, mr);
rsum1 = _mm_add_pd(rsum1, mr1);

Adding a timer you can see SSE code is very much faster than normal code. In my PC I observed that SSE code is almost 4 times faster than plain code.

Hence, using SSE instruction one can develop faster complex application where time optimization is required.

Last of All

This is my first post in CodeProject. There may be mistakes in this article. Please let me know and give me feedback.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



Comments and Discussions

 
GeneralMy vote of 3 Pin
YvesDaoust3-Dec-12 6:41
YvesDaoust3-Dec-12 6:41 
QuestionnMore info Pin
YvesDaoust3-Dec-12 6:40
YvesDaoust3-Dec-12 6:40 
AnswerRe: nMore info Pin
Shahadat Hossain Mazumder3-Dec-12 18:05
Shahadat Hossain Mazumder3-Dec-12 18:05 
GeneralRe: nMore info Pin
YvesDaoust3-Dec-12 20:45
YvesDaoust3-Dec-12 20:45 
In the past I also tended to process remainder elements sequentially; the more it goes, the more I try to process everything by SSE blocks, masking out the unwanted elements. In the case of the sum, I would pad the array with zeroes (if possible), which don't disturb the sum.

The reason is that the sequential remainder can be costly in the case of small vectors: 23 elements are added in just 5 SSE additions, plus... 3 sequential ones. (Unfortunately, padding also has a cost.) This effect is quite noticeable when working with 16 bytes blocks: for instance if processing 255 bytes, half of the time would be spent processing the final 15 ones. Yes, half.

I my opinion it is important to let the reader understand that working with vector instructions like SSE introduces new constraints and special post-processing steps like remainder elements handling and regrouping of the results. These steps often account for the largest part of the code and this is where you spend time programming and debugging.

modified 4-Dec-12 2:53am.

GeneralRe: nMore info Pin
Shahadat Hossain Mazumder3-Dec-12 23:18
Shahadat Hossain Mazumder3-Dec-12 23:18 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.