# Inner Product Experiment: CPU, FPU vs. SSE*

By , 20 May 2008

## Introduction

The inner product (or dot product, scalar product) operation is the major one in digital signal processing field. It is used everywhere, Fourier (FFT, DCT), wavelet-analysis, filtering operations and so on. With advances of SSE technology you can parallelize this operation to perform multiplication and addition on several numbers instantly. However what precision in calculations to choose, integer, floats, doubles? In this article I demonstrate inner product operation on shorts, ints, floats, doubles performed with both CPU and SSE/SSE2/SSE3 optimized versions.

## Background (optional)

You need understanding of inner product operation and SSE technology understanding. I like wikipedia for having answers to every question, have a look at inner product. In short you take 2 std::vector arrays (floats, ints, shorts, doubles) of equal length, multiply them element wise and sum up the entries in resulting vector producing one number. For SSE programming there is nice article Introduction to SSE Programming at codeproject.

## Using the code

Just run the console application and provide the first argument as the length of array for inner product. It creates 2 vectors of the same length with random entries and computes their inner product printing the results and processing times for chars, shorts/shorts SSE2, ins, floats/floats SSE/floats SSE3, doubles/doubles SSE2.

```>inner.exe 5000000
chars          processing time: 13 ms
-119
shorts         processing time: 17 ms
3241
shorts sse2    processing time: 6 ms
3241
ints           processing time: 16 ms
786540
floats         processing time: 30 ms
1339854
sse2 intrin    processing time: 13 ms
1339854
sse3 assembly  processing time: 12 ms
1339854
doubles        processing time: 22 ms
1107822
doubles sse2   processing time: 25 ms
1107822
```

That provided results for inner product of 2 vectors of size 5000000. The second line after each type denotes rounded result of operation. I ran it on my AMD turion 64 2.2Ghz processor. We can see that short precision is the fastest one with SSE2. Novel haddps instruction in SSE3 allows faster processing than SSE2. Double precision performed with FPU outperformed FPU floats. However SSE2 optimization of doubles decrease the speed of computation compared to floats where we have more than 2 times increase in speed with SSE compared to FPU.

If you perform DSP on image data use SSE2 with shorts, using integer filters, or floats. Do not optimize doubles, that will lead to decrease of performance. If you are not going to use SSE optimization use double precision, it is faster than floats.

I did not try it on other processors and if your results will be different let me know. The code for performing inner product operation presented below

On floats with FPU:

```std::vector<float< v1;
std::vector<float> v2;
v1.resize(size);
v2.resize(size);
float* pv1 = &v1[0];
float* pv2 = &v2[0];

for (unsigned int i = 0; i < size; i++) {
pv1[i] = float(rand() % 64 - 32);
pv2[i] = float(rand() % 64 - 32);
}

float sum = 0;
for (unsigned int i = 0; i < size; i++)
sum += pv1[i] * pv2[i];
wprintf(L" %d\n", (int)sum);
```

SSE2 optimized shorts:

```short sse2_inner_s(const short* p1, const short* p2, unsigned int size)
{
__m128i* mp1 = (__m128i *)p1;
__m128i* mp2 = (__m128i *)p2;
__m128i mres = _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, 0);

for(unsigned int i = 0; i < size/8; i++) {
mp1++;
mp2++;
}

short res[8];
__m128i* pmres = (__m128i *)res;
_mm_storeu_si128(pmres, mres);

return res[0]+res[1]+res[2]+res[3]+res[4]+res[5]+res[6]+res[7];
}
```

SSE optimized floats:

```float sse_inner(const float* a, const float* b, unsigned int size)
{
float z = 0.0f, fres = 0.0f;
__declspec(align(16)) float ftmp[4] = { 0.0f, 0.0f, 0.0f, 0.0f };
__m128 mres;

if ((size / 4) != 0) {
for (unsigned int i = 0; i < size / 4; i++)

//mres = a,b,c,d
__m128 mv1 = _mm_movelh_ps(mres, mres);     //a,b,a,b
__m128 mv2 = _mm_movehl_ps(mres, mres);     //c,d,c,d

_mm_store_ps(ftmp, mres);

fres = ftmp[0] + ftmp[1];
}

if ((size % 4) != 0) {
for (unsigned int i = size - size % 4; i < size; i++)
fres += a[i] * b[i];
}

return fres;
}
```

SSE3 optimized floats:

```float sse3_inner(const float* a, const float* b, unsigned int size)
{
float z = 0.0f, fres = 0.0f;

if ((size / 4) != 0) {
const float* pa = a;
const float* pb = b;
__asm {
movss   xmm0, xmmword ptr[z]
}
for (unsigned int i = 0; i < size / 4; i++) {
__asm {
mov     eax, dword ptr[pa]
mov     ebx, dword ptr[pb]
movups  xmm1, [eax]
movups  xmm2, [ebx]
mulps   xmm1, xmm2
}
pa += 4;
pb += 4;
}
__asm {
movss   dword ptr[fres], xmm0
}
}

return fres;
}
```

SSE optimized doubles:

```double sse_inner_d(const double* a, const double* b, unsigned int size)
{
double z = 0.0, fres = 0.0;
__declspec(align(16)) double ftmp[2] = { 0.0, 0.0 };
__m128d mres;

if ((size / 2) != 0) {
for (unsigned int i = 0; i < size / 2; i++)

_mm_store_pd(ftmp, mres);

fres = ftmp[0] + ftmp[1];
}

if ((size % 2) != 0) {
for (unsigned int i = size - size % 2; i < size; i++)
fres += a[i] * b[i];
}

return fres;
}
```

 Chesnokov Yuriy Engineer Russian Federation Member
Former Cambridge University postdoc (http://www-ucc-old.ch.cam.ac.uk/research/yc274-research.html), Department of Chemistry, Unilever Centre for Molecular Informatics, where I worked on the problem of complexity analysis of cardiac data.

As a subsidiary result we achieved 1st place in the annual PhysioNet/Computers in Cardiology Challenge 2006: QT Interval Measurement (http://physionet.org/challenge/2006/)

My research intrests are: digital signal processing in medicine, image and video processing, pattern recognition, AI, computer vision.

My recent publications are:

Complexity and spectral analysis of the heart rate variability dynamics for distant prediction of paroxysmal atrial fibrillation with artificial intelligence methods. Artificial Intelligence in Medicine. 2008. V43/2. PP. 151-165 (http://dx.doi.org/10.1016/j.artmed.2008.03.009)

Face Detection C++ Library with Skin and Motion Analysis. Biometrics AIA 2007 TTS. 22 November 2007, Moscow, Russia. (http://www.dancom.ru/rus/AIA/2007TTS/ProgramAIA2007TTS.html)

Screening Patients with Paroxysmal Atrial Fibrillation (PAF) from Non-PAF Heart Rhythm Using HRV Data Analysis. Computers in Cardiology 2007. V. 34. PP. 459–463 (http://www.cinc.org/archives/2007/pdf/0459.pdf)

Distant Prediction of Paroxysmal Atrial Fibrillation Using HRV Data Analysis. Computers in Cardiology 2007. V. 34. PP. 455-459 (http://www.cinc.org/archives/2007/pdf/0455.pdf)

Individually Adaptable Automatic QT Detector. Computers in Cardiology 2006. V. 33. PP. 337-341 http://www.cinc.org/archives/2006/pdf/0337.pdf)

Votes of 3 or less require a comment

Hint: For improved responsiveness ensure Javascript is enabled and choose 'Normal' from the Layout dropdown and hit 'Update'.
 Search this forum Profile popups    Spacing RelaxedCompactTight   Noise Very HighHighMediumLowVery Low   Layout Open AllThread ViewNo JavascriptPreview   Per page 102550
 First Prev Next
 Results for Core2Duo E7500 ijkmailru 22 Mar '12 - 6:11
 Better use of SSE riffmaster 5 Feb '10 - 1:16
 Re: Better use of SSE BorisVidolov 27 Nov '11 - 11:46
 vector usage Hatem Mostafa 27 May '08 - 2:57
 Experiment GPU vs. ... soho 1 Apr '08 - 2:27
 Re: Experiment GPU vs. ... reinux 5 Apr '08 - 11:11
 Re: Experiment GPU vs. ... asdfasd23423f 25 Sep '08 - 6:27
 results on my computer : Intel Pentium 4 3000 MHz epitalon 16 Jan '08 - 8:44
 Re: results on my computer : Intel Pentium 4 3000 MHz Chesnokov Yuriy 16 Jan '08 - 18:58
 Re: results on my computer : Intel Pentium 4 3000 MHz epitalon 16 Jan '08 - 22:57
 Re: results on my computer : Intel Pentium 4 3000 MHz Chesnokov Yuriy 17 Jan '08 - 0:20
 Re: results on my computer : Intel Pentium 4 3000 MHz Stuart Dootson 1 Apr '08 - 10:11
 Change MS VS compiler setting, I can get performance improved isomer 14 Nov '07 - 8:07
 Convolution Operation Jeffrey Walton 25 Oct '07 - 1:07
 Last Visit: 31 Dec '99 - 18:00     Last Update: 21 May '13 - 20:27 Refresh 1

Web03 | 2.6.130516.1 | Last Updated 20 May 2008
Article Copyright 2007 by Chesnokov Yuriy