Single precision floating point and double precesion floating values operations in SSE optimization

Shahadat Hossain Mazumder

4.33/5 (3 votes)

Nov 27, 2012

CPOL

2 min read

18878

124

Single precision floating point and double precesion floating values operations in SSE optimization

Download sse_sum.zip - 726 B

Introduction

The Intel SSE intrinsic technology boosts the performance of floating point calculations. Both GCC and Microsoft Visual Studio supports SSE intrinsic. The xmm0-xmm15 (16 xmm registers for 64bit operating system) or xmm0-xmm7(8 xmm registers for 32 bit operating system) registers used for floating point calculations in SSE. Operations in SSE for single precision floating point and double precision floating point is a bit different. My objective is to point the differences between the calculation between these two data types using simple summation operation in floating point array.

SSE Programming

All SSE instructions and data types are defined in #include <xmmintrin.h>. __m128 is used for single precision floating point number and __m128d is used for double precision numbers. _mm_load_pd is used for loading double precision floating point number and _mm_load_ps is used loading for single precision floating point numbers. Similarly, _mm_add_ps, _mm_hadd_ps are used for adding single precision floating point numbers. Meanwhile, _mm_add_pd and _mm_hadd_pd are used for adding double precision floating point numbers. The float point array has to be aligned 16 and that can be done using _mm_malloc.

_mm_add_ps adds the four single precision floating-point values

r0 := a0 + b0
r1 := a0 + b1
r2 := a2 + b2
r3 := a3 + b3

_mm_add_pd adds the two double precision floating-point values

r0 := a0 + b0
r1 := a1 + b1

Code

This is the plain C code which we are we wish to convert codes using SSE.

float sum = 0;  //for double precision: double sum = 0;   
for (int i = 0; i < n; i++) {
    sum += scores[i];            
}

Single precision floating point number addition Sample code:

float sum  = 0.0;		
__m128 rsum  = _mm_set1_ps(0.0);
for (int i = 0; i < n; i+=4)
{		
	__m128 mr  = _mm_load_ps(&a[i]);					
	rsum = _mm_add_ps(rsum, mr);
}
rsum = _mm_hadd_ps(rsum, rsum);
rsum = _mm_hadd_ps(rsum, rsum);
_mm_store_ss(&sum, rsum);

Double precision floating point number addition Sample code:

double sum  = 0.0;
double sum1  = 0.0;	
__m128d rsum  = _mm_set1_pd(0.0);
__m128d rsum1  = _mm_set1_pd(0.0);	
for (int i = 0; i < n; i += 4)
{		
	__m128d mr  = _mm_load_pd(&a[i]);	
	__m128d mr1 = _mm_load_pd(&a[i+2]);			
	rsum = _mm_add_pd(rsum, mr);
	rsum1 = _mm_add_pd(rsum1, mr1);			
}
rsum = _mm_hadd_pd(rsum, rsum1);
rsum = _mm_hadd_pd(rsum, rsum);
_mm_store_sd(&sum, rsum);

You can see the difference between single precision float and double precision float is that you can add 4 values in one operation of single precision floating point number

rsum = _mm_add_ps(rsum, mr);

You can add 2 values in one operation and therefore you need two operations for 4 values

rsum = _mm_add_pd(rsum, mr);
rsum1 = _mm_add_pd(rsum1, mr1);

Adding a timer you can see SSE code is very much faster than normal code. In my PC I observed that SSE code is almost 4 times faster than plain code.

Hence, using SSE instruction one can develop faster complex application where time optimization is required.

Last of All

This is my first post in CodeProject. There may be mistakes in this article. Please let me know and give me feedback.