## Introduction

The Intel SSE intrinsic technology boosts the performance of floating point calculations. Both GCC and Microsoft Visual Studio supports SSE intrinsic. The xmm0-xmm15 (16 xmm registers for 64bit operating system) or xmm0-xmm7(8 xmm registers for 32 bit operating system) registers used for floating point calculations in SSE. Operations in SSE for single precision floating point and double precision floating point is a bit different. My objective is to point the differences between the calculation between these two data types using simple summation operation in floating point array.

## SSE Programming

All SSE instructions and data types are defined in `#include <xmmintrin.h>`

. `__m128`

is used for single precision floating point number and `__m128d`

is used for double precision numbers. `_mm_load_pd`

is used for loading double precision floating point number and `_mm_load_ps`

is used loading for single precision floating point numbers. Similarly, `_mm_add_ps`

, `_mm_hadd_ps`

are used for adding single precision floating point numbers. Meanwhile, `_mm_add_pd`

and `_mm_hadd_pd`

are used for adding double precision floating point numbers. The float point array has to be aligned 16 and that can be done `using _mm_malloc`

.

`_mm_add_ps`

adds the four single precision floating-point values

r0 := a0 + b0
r1 := a0 + b1
r2 := a2 + b2
r3 := a3 + b3

`_mm_add_pd`

adds the two double precision floating-point values

r0 := a0 + b0
r1 := a1 + b1

## Code

This is the plain C code which we are we wish to convert codes using SSE.

float sum = 0; for (int i = 0; i < n; i++) {
sum += scores[i];
}

Single precision floating point number addition Sample code:

float sum = 0.0;
__m128 rsum = _mm_set1_ps(0.0);
for (int i = 0; i < n; i+=4)
{
__m128 mr = _mm_load_ps(&a[i]);
rsum = _mm_add_ps(rsum, mr);
}
rsum = _mm_hadd_ps(rsum, rsum);
rsum = _mm_hadd_ps(rsum, rsum);
_mm_store_ss(&sum, rsum);

Double precision floating point number addition Sample code:

double sum = 0.0;
double sum1 = 0.0;
__m128d rsum = _mm_set1_pd(0.0);
__m128d rsum1 = _mm_set1_pd(0.0);
for (int i = 0; i < n; i += 4)
{
__m128d mr = _mm_load_pd(&a[i]);
__m128d mr1 = _mm_load_pd(&a[i+2]);
rsum = _mm_add_pd(rsum, mr);
rsum1 = _mm_add_pd(rsum1, mr1);
}
rsum = _mm_hadd_pd(rsum, rsum1);
rsum = _mm_hadd_pd(rsum, rsum);
_mm_store_sd(&sum, rsum);

You can see the difference between single precision float and double precision float is that you can add 4 values in one operation of single precision floating point number

rsum = _mm_add_ps(rsum, mr);

You can add 2 values in one operation and therefore you need two operations for 4 values

rsum = _mm_add_pd(rsum, mr);
rsum1 = _mm_add_pd(rsum1, mr1);

Adding a timer you can see SSE code is very much faster than normal code. In my PC I observed that SSE code is almost 4 times faster than plain code.

Hence, using SSE instruction one can develop faster complex application where time optimization is required.

## Last of All

This is my first post in CodeProject. There may be mistakes in this article. Please let me know and give me feedback.